Ranked Retrieval

This builds upon Boolean Retrieval.

Similar to Boolean Retrieval, you can do this 2 ways:

  1. Document-at-a-time
  2. Term-at-a-time

Document-At-A-Time For each document:

  1. Score each query term, and add them all up
  2. Accumulate best k hits
  3. A min-heap is good for this

PRO: time is O(n log k), memory is O(k) – k probably a constant CON: can’t terminate early, must look at whole document collection

Term-At-A-Time

  1. Collect hits and ranking for rarest term into accumulator
  2. For each other term in the query:
  3. If a document does NOT have that term, remove from accumulator
  4. Otherwise, add next term’s ranking to overall ranking

PRO: Can have early termination heuristics, will not normally need to traverse all documents

CON: uses a lot of memory