NoSQL
Learned in SE464.
Review of sharding: splits data across machines, accepts writes on all machines, but partitioning is manual and works well only for queries handled within a single shard. If we keep the relational model with normalized data, many queries still involve all nodes, so scaling is limited.
NoSQL databases solve this by denormalizing data: duplicating it so that queries can be answered by a single node.
NoSQL
Removing the ability to create references gives us a NoSQL database. Instead of following references with JOINs, we store denormalized data with copies of referenced data.
NoSQL DBs are key-value stores. Hashing is the basis of distributed NoSQL; see Distributed Hash Table.
Downsides
- Just one indexed column (the key), because the index is built with hash-based partitioning
- Denormalized data is duplicated
- Wastes space
- Cannot be edited in one place (e.g., “Greater Seattle Area” repeated in many user profiles instead of
region: 91)
- References are possible, but:
- Following the reference requires another query, probably to another node
- No constraint checking (refs can become invalid after delete)
Recap
- Data partitioning is necessary to divide write load among nodes
- SQL sharding was a special case of data partitioning, done in app code
- NoSQL databases make partitioning easy by eliminating references
- Without references, data becomes denormalized; duplicated data consumes more space and can become inconsistent
- Very scalable, but provides only a simple key-value abstraction