What does currently affect which field scores best? BM25 scoring .
BM25 is the scoring algorithm used by Apache Solr and Lucene.
This algorithm used the document frequency of a term to basically estimate how important a term is among the other query terms.
So the rarer in the corpus a query term is, the more important is considered for the overall query.
This is fine for a single field search.
But using it in Disjunctions queries brings a nasty effect:
Document Frequency is used to select the best field that matches a term (even a single term).
And it does that in a counter-intuitive way: the field where the term appears the less is considered the best.
Consequences are evident:
Corpus
Document: Comics Issue
Fields: id, title, heroes, villains
Query
Text: batman
Query Fields: heroes, villains
This results in a disjunction query: (heroes:batman | villains:batman)
So the score of each document will be the highest score among the potential heroes:batman match and the potential villains:batman match.
Let’s assume all other factors(term frequency, average document length,…) are a tie, Document Frequency will be the discriminant in deciding the best field match for each candidate document that needs to be ranked.
And currently, it works the opposite a user would intuitively expect: comics issues where batman was the villain of the story will sky-rocket to the top.
Inversed Document Frequency is used to select the best field for a term rather than the most important term in a multi term query.
Rashmeet
May 19, 2022Hi,
I’m trying to experiment with the sow param but am experiencing something weird.
For my queries that are processed by Solr with sow=false, I am able to set it to true and see the results change. But for those that end up in a “term centric” search, setting sow=false is not working for me. Even if i set sow=false, I still see term centric search happening in the parsed query and no changes in the result set. Any idea why this might be happening or how I can make it work please?
Thanks
Alessandro Benedetti
May 30, 2022Hi Rashmeet, thank’s for your comment,
“for those that end up in a “term centric” search” can you give an example?
Resolved: Which analyzer is used while using fuzzy operator with query_string clause? - Daily Developer Blog
December 13, 2022[…] to appear in a single field”. If you want an in-depth analysis you can read this blog post https://sease.io/2021/05/apache-solr-sow-parameter-split-on-whitespace-and-multi-field-full-text-sea…. This is relative to Solr but Elasticsearch applies the same […]