Data Partitioning

Learned in SE464.

Data partitioning divides by rows: this is database sharding when done manually at the application layer.

Data partitioning is a more general approach than functional partitioning: instead of splitting by tables, rows are distributed across nodes.

Pros

Because each row is stored once:

  • ✓ Capacity scales
  • ✓ Data is consistent

If the sharding key is chosen carefully:

  • ✓ Data will be balanced across nodes
  • ✓ Many queries will involve only one or a few shards (no central bottleneck)

Cons

  • ✘ Cannot use plain SQL; queries must be manually adapted to match sharding
  • ✘ If the sharding key is chosen poorly, shard load will be imbalanced (by capacity or traffic)
  • ✘ Some queries will involve all shards, so their throughput is limited by a single machine

Partitioning should minimize references between partitions. It can be treated as a graph partitioning problem.