Parquet
First learned about this in CS451.
Apache Parquet stores data in a column-oriented, binary format, rather than row-by-row, which allows for highly efficient compression and faster analytical queries.
It organizes data into hierarchical structures:
- files are divided into Row Groups
- Files are broken down into Column Chunks
- Column chunks are broken down into Pages, with embedded metadata.
Data Compression
- Dictionary Encoding: Parquet has an automatic dictionary encoding enabled dynamically for columns with a cardinality below 105 (i.e. number of unique values)
- Bit Packing: Parquet packs multiple integers into the same space.
- Run-Length Encoding