Why Your Data Loads Take Too Long

During twenty years of doing application development for clients, I have optimized process automation data loads primarily by changing INSERT statements into a BULK LOAD process.  In SQL Server, an external command line utility (bcp.exe) allows bidirectional data streams to flow.  In MySQL, the inward flow is managed by the LOAD DATA INFILE command.

The key to using these BULK LOAD processes is formatting the data properly in files before loading the data.  The best language for parsing large streams of data is Perl.  If you convert your INSERT  statement-generating PHP, Java, C#, VB, etc. to a Perl scrubbing and formatting operation and use the database’s BULK LOAD utility, I am sure you will see a significant increase in the performance of your data loads.  I have used operating system shell scripting languages to tie it all together for automation.

Another factor is the configuration settings of the relational database logging operation.  Most simplified databases do not log transactions because they do not enforce ACID principals.  In fact, most databases, relational or not, are based on straight-forward ISAM principals.  Transactional databases, those that require recovery of transactional (e.g. INSERT) failures, perform extra operations to allow for that recovery.   But what most freelance web programmers don’t know is that this extra activity at every INSERT of a row slows the load process down significantly.  In a data warehouse, this transactional logging can be turned off.  Data recovery is controlled by managing the files used during the bulk load operation.

A BULK LOAD operation in transactional databases will treat the entire load as a single transaction, so breaking up a large set of rows into separate files may make the end-user support more manageable.  The cause of failed file loads can be investigated and dealt with on a more granular level if the files are split into manageable pieces.  What defines “manageable” is determined on a case by case basis by the latency and use requirements of the data.

 

 

Leave a Reply

Your email address will not be published. Required fields are marked *