By using the right compression techniques, we can accelerate database queries by orders of magnitude. Unsurprisingly, data warehouse vendors compete over compression ratios. Alas, we often have to choose between more compression or more speed. By using lightweight compression techniques, we can get both: reduced storage and improved speed. To maximize compression, it helps to organize the data in columns (at least within disk pages). Sorted tables are much more compressible. We can surpass the lexicographical sort (by 50% or more) if we apply row-reordering heuristics inspired by the Traveling Salesman Problem (TSP). Effectively, we seek an optimal tour through the rows of a table. We adapt the TSP to our compression technique: e.g., run-length encoding corresponds to the Hamming distance. Many TSP heuristics can scale to billions or rows. They can be made even more scalable if we adapt them specifically for our context (database compression) and for the chosen distance measures. Moreover, we want the compressibility of the columns to reflect the expected workload. For example, perharps we want several columns to have uniformly excellent compression, at the expense of other columns. This problem is closely related to the problem of finding "balanced Gray codes". While it is natural to view a table as a list of rows, we can also organize the rows in a graph (such as a spanning tree) if the storage of the graph structure is sufficiently inexpensive. Instead of seeking the best tour through the rows (as in the TSP), we seek the best spanning tree. The row-reordering problem then becomes a special case where the graph is constrained to a list. Using such a graph approach, preliminary results show that we can surpass the best row-reordering heuristics in improving compressibility with a small overhead at decompression time. We conjecture that this may provide a superior storage strategy for column-oriented databases. Our interest is not limited to the compression of relational databases: it extends to document-oriented databases. We conjecture that they could be just as efficient with respect to storage as relational databases, under mild assumptions.

Chercheur principal

Daniel Lemire

Organisme subventionnaire

CRSNG (Conseil de recherches en sciences naturelles et en génie du Canada)

Programme

Subvention à la découverte - individuelle

Secteur de recherche

Informatique cognitive

Années

2012 - 2017

Montant accordé

140 000,00 $