Parquet

First learned about this in CS451.

Apache Parquet stores data in a column-oriented, binary format, rather than row-by-row, which allows for highly efficient compression and faster analytical queries.

It organizes data into hierarchical structures:

  1. files are divided into Row Groups
  2. Files are broken down into Column Chunks
  3. Column chunks are broken down into Pages, with embedded metadata.

Data Compression

  • Dictionary Encoding: Parquet has an automatic dictionary encoding enabled dynamically for columns with a cardinality below 105 (i.e. number of unique values)
  • Bit Packing: Parquet packs multiple integers into the same space.
  • Run-Length Encoding