A couple of weeks ago I was writing a small application to perform some ETL (Extract, Transform, Load) operations. As I was researching an efficient way to perform a large number of SQL insert statements, I found out about the BCP (Bulk CoPy) utility for MS SQL Server. This tool performs efficient data insertion (something in the order of thousands of row inserts per second), using as source CSV (Comma Separated Values) files. The mapping is performed through configuration files in a semi-structured format, or using an XML file (the latter is strongly recommended). Basically, the trick resides that unlike in regular inserts, database restrictions (primary keys, foreign keys, unique restrictions, null values, etc.) are not enforced for each row, but only at the end of the introduction, or per each block.
This was good solution for my needs at the time, which was to extract a large number of data from CSV files into a small number of tables, on a secondary database to the system in question. Of course, the mapping is limited: you cannot perform data normalization, only insert/remove columns, or switch its order. This means that either the input data is normalized, or you end up with a denormalized database. In sum, BCP does not replace an ETL tool, but provides an interesting complement in the "Loading" part.
By the way, a very nice book on the subject is Professional SQL Server 2005 Programming, by Robert Vieira, published by Wrox.
Until next time, may you enjoy bulk copying.
No comments:
Post a Comment