Query Cost

Query cost is the DBMS’s estimate of how much work a query plan will take. Introduced in ECE459 L19. The dominant term is disk I/O; CPU time is nonzero but ignored for simplicity, as in [SKS11].

Why estimate at all?

The optimizer needs a number to compare alternative plans. Actually running a query to measure it defeats the point of planning, so we guess from metadata and statistics.

Two measurables of interest:

  • Block transfers: data moved in or out of memory
  • Disk seeks: repositioning on the disk

[SKS11] estimates assume the worst case: only one block per table fits in memory at a time. If reality is better, the actual cost is lower than estimated, which is the less dangerous direction to be wrong.

Things the estimate doesn’t capture

  • System load (concurrent operations may block)
  • Buffer contents (data already in memory skips disk ops)
  • Data layout (packed data means fewer seeks, distributed data enables parallel reads)

The lowest-cost plan is not necessarily the fastest in wall-clock terms. The DB optimizer is a “hypermiler”: it minimizes disk reads, even if a path with more reads but more parallelism would finish sooner.

Metadata used in estimates [SKS11, EN11]

  • : rows in table
  • : blocks containing
  • : bytes in , equal to blocks times block size
  • : rows per block, equal to
  • : distinct values of attribute in . Equals when is a key
  • : height of index on

Histograms divide a value range into buckets with counts. Distributions are rarely even. For “salary ≥ 150k), interpolate within the bucket.

Stats can go stale between updates, but “close enough, more often than not” suffices. The optimization only needs to beat doing nothing.