Ranked Retrieval
This builds upon Boolean Retrieval.
Similar to Boolean Retrieval, you can do this 2 ways:
- Document-at-a-time
- Term-at-a-time
Document-At-A-Time For each document:
- Score each query term, and add them all up
- Accumulate best k hits
- A min-heap is good for this
PRO: time is O(n log k), memory is O(k) – k probably a constant CON: can’t terminate early, must look at whole document collection
Term-At-A-Time
- Collect hits and ranking for rarest term into accumulator
- For each other term in the query:
- If a document does NOT have that term, remove from accumulator
- Otherwise, add next term’s ranking to overall ranking
PRO: Can have early termination heuristics, will not normally need to traverse all documents
CON: uses a lot of memory