At Locus Energy, there are a few different types of databases, each containing large quantities of data that are moved around frequently for caching or ease of access. This process comes with the unique challenge of determining an efficient and reliable method for inter-database data transfer. Some tables participate in Locus' external API, making long-lasting table locks unattractive. Additionally, with most of the data being updated regularly, enforcing data integrity requires atomic interactions with the database.
In concrete terms, Locus' goal is to hit a given database with a query and insert the results into a table in a different database. The goal is to cause as little disruption as possible, while maintaining the accuracy of the data. In many cases, the source and destination are entirely different RDBMS.
A Naive Approach
The most straightforward way to tackle this problem is to simply insert into the destination table as the query's result set is iterated over. For example:
This works nicely in terms of memory and disk usage, assuming that the database does not buffer the entire result set upon query execution. However, there are serious issues when the result set contains a significant number of rows. Each “execute ()” is a relatively expensive operation, requiring a command to be sent to the database and a response to be read. Because of this overhead, Locus will hold a lock on the queried table(s) for significantly longer than necessary. Furthermore, the insertion will hold row locks, meaning a majority of the table contents will be blocked throughout a potentially lengthy operation.
Staging and Batching
Locus can make a couple of simple adjustments to the previous approach that will help to deal with the aforementioned concerns. The first is to introduce some additional SQL setup. Rather than lock our external facing tables, a stage, swap, and drop approach is attempted. The idea here is to utilize constant time SQL operations to isolate the transaction and increase fault tolerance. The syntax is some variation of the following:
Note that “create,” “rename,” and “drop” are all constant time operations. For an insignificant time cost, and doubled size requirements for each staged table, the row locks on the destination table are completely removed. Additionally, runtime failures during the longest running operation, the insertion, will have minimal consequences, it can simply be aborted and the table swap skipped.
This can then be combined with the batch insertion feature, which most RDBMS provide.
This will greatly decrease the number of insertions. Nonetheless, a few caveats remain:
- A trade-off is introduced. Larger batch sizes will increase speed, but also increase memory usage
- Databases generally have limits on the maximum size of a batch insert command
- A lock is still being held back for longer than necessary while fetching results
Database developers have already considered the need to import large chunks of data into a table. As a result, the “load data infinite” (MySQL) and “copy” (PostgreSQL) commands are available. These commands will consistently outperform “insert” for larger sets of data. They require CSV formatted results, contained in a file or file-like object.
To utilize these commands, an adjustment can be made to the staging and batching technique:
The query and insertion operations have been separated, meaning that the query results will be fetched as quickly as possible and locks will be held for the minimum amount of time. The major drawback here is that disk I/O has been introduced by writing the CSV to a file. This could potentially be avoided by using STDIN/STDOUT to send the data around, but the implementation varies between databases and may require tricks like providing the STDIN path in the file system. Instead, a file-like structure designed for this type of situation can be leveraged.
Unix systems offer a "first-in-first-out" structure via "mkfifo ()" The file created by "mkfifo ()" has a few important characteristics.
- Opening the file will block until there is exactly one process with the file open for reading, and at least one process with the file open for writing
- The reader will only receive an EOF once the last writer has closed the file
- Data is removed from the file as it is read
Windows offers named-pipes as well, but the implementation is slightly different.
The final result:
As CSV formatted results are flushed to the pipe, they are simultaneously loaded into the database. This operation has a negligible disk and memory cost, while also utilizing the fastest method for bulk imports. Minimized locking is maintained on the queried tables and no locking on the destination table.
DB I/O package
DB I/O is a Python package that implements a safer variation of the “mkfifo ()” code snippet by providing more sophisticated process management. The package also abstracts away all of the SQL syntax required for the supported databases. For example, using DB I/O, a PostgreSQL database can be queried and the results streamed into a MySQL table with two lines:
The package also features a command-line interface, with support for some convenient methods for exporting or importing custom formatted CSV files.
Currently, the package supports MySQL, PostgreSQL, Vertica, and SQLite databases, but can easily be extended to support any database that has a SQLAlchemy dialect.