NoSQL

Learned in SE464.

Review of sharding: splits data across machines, accepts writes on all machines, but partitioning is manual and works well only for queries handled within a single shard. If we keep the relational model with normalized data, many queries still involve all nodes, so scaling is limited.

NoSQL databases solve this by denormalizing data: duplicating it so that queries can be answered by a single node.

NoSQL

Removing the ability to create references gives us a NoSQL database. Instead of following references with JOINs, we store denormalized data with copies of referenced data.

NoSQL DBs are key-value stores. Hashing is the basis of distributed NoSQL; see Distributed Hash Table.

Downsides

  • Just one indexed column (the key), because the index is built with hash-based partitioning
  • Denormalized data is duplicated
    • Wastes space
    • Cannot be edited in one place (e.g., “Greater Seattle Area” repeated in many user profiles instead of region: 91)
  • References are possible, but:
    • Following the reference requires another query, probably to another node
    • No constraint checking (refs can become invalid after delete)

Recap

  • Data partitioning is necessary to divide write load among nodes
  • SQL sharding was a special case of data partitioning, done in app code
  • NoSQL databases make partitioning easy by eliminating references
  • Without references, data becomes denormalized; duplicated data consumes more space and can become inconsistent
  • Very scalable, but provides only a simple key-value abstraction