Best Match 25 (BM25)

BM25 (Best Match 25) is one of the most widely used ranking functions in information retrieval (like search engines). It’s part of the probabilistic retrieval framework and is essentially an improved version of TF-IDF.

Modern search engine behind Elasticsearch.

  • Term Frequency (TF): A document with more occurrences of a query term should be more relevant β€” but only up to a point.
  • Inverse Document Frequency (IDF): Rare terms are more informative than very common ones (like the, and).
  • Length Normalization: Longer documents naturally contain more words, so scores are normalized by length.

The BM25 score of a document for a query is:

Where:

  • = frequency of term in document
  • = document length (in words)
  • = average document length in the collection
  • = saturation parameter (β‰ˆ 1.2 – 2.0)
  • = length normalization parameter (β‰ˆ 0.75)
  • = inverse document frequency:
  • with = total number of documents, and = number of documents containing .
  • This comes from RSJ’s, where we don’t know and