Apache Solr sow Parameter (Split on Whitespace) and Multi-Field Full-Text Search

The sow(split on whitespace) is an eDismax query parser parameter [1] that regulates aspects of query time text analysis that impact how the user query is parsed and the internal Lucene query is built.
It is particularly relevant in multi-term and multi-field search.
If sow=true :

    • first the user query text is tokenized with a split on whitespace approach
    • then for each term in the token stream N disjunction clauses are built, one for each of the query fields involved in the search
    • query time text analysis happens (so the text analysis takes in input a single term per field)

e.g.

    • field1 has multi-term synonyms configured: uk, united kingdom, england, london, british, britain
    • field2 has English stemming configured for query time text analysis
sow = true
qf = field1 field2
q = united kingdom
defType = edismax

"parsedquery_toString":"+((field1:united | field2:unit) (field1:kingdom | field2:kingdom))"

No multi-term synonym expansion happened for field1, which is expected because the text analysis chain was called two separate times on single terms: field1:united and field1:kingdom.
The text analysis chain for field1 has never seen the text “united kingdom”, so no synonym expansion happened.
On the other hand, text analysis for field2 has produced stemmed terms (field2:united => field2:unit), which is expected as stemming is applied token by token.


If sow=false :

    • the user query is not tokenized
    • a disjunction clause with the entire text is built for each one of the query fields involved in the search
    • query time text analysis happens (so the text analysis, for each field, takes in input the full text)

e.g.

    • field1 has multi-term synonyms configured: uk, united kingdom, england, london, british, britain
    • field2 has English stemming configured for query time text analysis
    • both have autoGeneratePhraseQueries=”true” (this is needed for expanded multi-term synonyms)
sow = false
qf = field1 field2
q = united kingdom
defType = edismax

"parsedquery_toString":
"+((field2:unit field2:kingdom) | 
((field1:uk field1:\"united kingdom\" field1:england field1:london field1:british field1:britain)))",

Multi-term synonym expansion happened for field1, which is expected because the text analysis chain was called once, with the full text: field1:(united kingdom).
The text analysis chain for field1 analysed text “united kingdom” and the analysis chain took care of synonym expansion.

If you have text analysis chains configured at query time, that needs the full text in input to work properly, because there is token filters that involve multiple sequential tokens to me transformed into new ones, you need to avoid initial white space tokenization, so sow=false is mandatory:
e.g.

    • Synonyms mapping from multi-terms: united kingdom => uk, united kingdom, england, london, british, britain
    • shingle token filtering [2]: united kingdom exploration =>united kingdom, kingdom exploration

If your text analysis is not compatible with white space tokenization, you must use sow=false
e.g.

<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldType>

using sow=true will just build multiple query clauses on terms that won’t match the index:
e.g.

sow = true
qf = author_keyword
q = united kingdom
defType = edismax

"parsedquery_toString":
"+((author_keyword:united) (author_keyword:kingdom))"

using sow=false is mandatory in this case

sow = false
qf = author_keyword
q = united kingdom
defType = edismax

"parsedquery_toString":
"+(author_keyword:united kingdom)",

If your field type is a string (which is not analyzed) sow=false won’t help.
This is a current bug [3]
e.g.

<field name="author_s" type="string" indexed="true" stored="true" multiValued="false" />

using sow=true will just build multiple query clauses on terms that won’t match the index:
e.g.

sow = true
qf = author_s
q = united kingdom
defType = edismax

"parsedquery_toString":
"+((author_s:united) (author_s:kingdom))"

using sow=false is also failing

sow = false
qf = author_s
q = united kingdom
defType = edismax

"parsedquery_toString":
"+((author_s:united) (author_s:kingdom))"

Given the evident advantages of setting sow=false, why does this parameter exists? i.e. why and when you should set it to true?

The following considerations assume the tie parameter [4] is set to the default 0.
Further considerations on the topic follow.

    • If your text analysis produces for all query fields, the exact same amount of tokens, you won’t see any difference in setting sow=true or sow=false
    • if your text analysis produces from the input text a different amount of tokens per field, the difference kicks in: if you want documents containing more query terms(not necessarily in the same field) to score generally higher you should use sow=true.
      sow=true each term contributes to the score once (if it matches twice, the field that scores best is taken)
      You should prefer this strategy if generally, you prefer documents that match more query terms to be favored.
      Does this mean that having more terms will always win over fewer terms? NO, not necessarily
      because a match of one field in a super rare field, may still dominate because there’s no coord factor anymore [5]
      The coord factor was used to normalize the contribution of each boolean clause to 1/n where n was the number of query terms involved.
      sow=false only one field contributes to the score (the one that scores best)
    • if you want to search for multiple values at the same time in fields keyword tokenized

What does currently affect which field scores best? BM25 scoring .
BM25 is the scoring algorithm used by Apache Solr and Lucene.
This algorithm used the document frequency of a term to basically estimate how important a term is among the other query terms.
So the rarer in the corpus a query term is, the more important is considered for the overall query.
This is fine for a single field search.
But using it in Disjunctions queries brings a nasty effect:
Document Frequency is used to select the best field that matches a term (even a single term).
And it does that in a counter-intuitive way: the field where the term appears the less is considered the best.
Consequences are evident:
Corpus
Document: Comics Issue
Fields: id, title, heroes, villains

Query
Text: batman
Query Fields: heroes, villains

This results in a disjunction query: (heroes:batman | villains:batman)
So the score of each document will be the highest score among the potential heroes:batman match and the potential villains:batman match.
Let’s assume all other factors(term frequency, average document length,…) are a tie, Document Frequency will be the discriminant in deciding the best field match for each candidate document that needs to be ranked.
And currently, it works the opposite a user would intuitively expect: comics issues where batman was the villain of the story will sky-rocket to the top.
Inversed Document Frequency is used to select the best field for a term rather than the most important term in a multi term query.

The tie parameter [4] flattens the impact of the “best” match among all other matches.
In the extreme case tie=1 when all matches just contribute as a sum, sow=true or sow=false becomes un-influential for scoring purposes (all the multi-term token filtering observations remain).

The sow parameter affects the mm parameter [6].
When the query parsed moves from being term centric(sow=true) to field centric(sow=false and different text analysis), mm means two different things:
mimimum of query terms matched, independently in which field

sow = true
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax

"parsedquery_toString":
"+(((author:united | subjects_as_same_term:united) (author:kingdom | subjects_as_same_term:kingdom))~2)"
"response":{"numFound":2,"start":0,"maxScore":7.757958,"numFoundExact":true,"docs":[
      {
        "id":"888888",
        "author":"united",
        "subjects":["kingdom"],
        "score":7.757958},
      {
        "id":"77777",
        "author":"united kingdom",
        "score":5.874222}]
  },

mimimum of query terms matched within the same field (i.e. all query terms required must be found in one of the fields)

sow = false
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax

"parsedquery_toString":
"+(((author:united author:kingdom)~2) | 
(((subjects_as_same_term:uk subjects_as_same_term:\"united kingdom\" 
subjects_as_same_term:england subjects_as_same_term:london 
subjects_as_same_term:british subjects_as_same_term:britain))~1))"

This (author:united author:kingdom)~2 means we need both the clauses to match to have a good candidate, in disjunction with
(subjects_as_same_term:uk subjects_as_same_term:\”united kingdom\” subjects_as_same_term:england subjects_as_same_term:london subjects_as_same_term:british subjects_as_same_term:britain))~1 that means we need at least one clause to match (because synonyms expanded the two original terms into a single one)

"response":{"numFound":1,"start":0,"maxScore":5.874222,"numFoundExact":true,"docs":[
      {
        "id":"77777",
        "author":"united kingdom",
        "score":5.874222}]
  }

The eDismax query parser is a tool, offering many configuration points and probably it ended up being too complicated nowadays.
Let’s summarize the current outstanding problems:

    • if the query fields share the same text analysis and no multi-term token filtering is involved, sow=true or sow=false makes no difference
    • since coord has been removed, sow=true won’t necessarily favor multi terms matches that strongly anymore
    • when a disjunction is involved in choosing the best field for a query term, due to BM25 Inversed Document Frequency the “less popular” field match is always chosen when tie-breaking

We are going to continue working on the subject, follow our blog and researches for more info!
Is a new query parser coming to the Apache Solr world?
Stay Tuned!

Did I mention we do Apache Solr Beginner and Elasticsearch Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!

Did you like this post about the Apache Solr sow Parameter (Split on Whitespace) and Multi-Field Full-Text Search? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

 

We are Sease, an Information Retrieval Company based in London, focused on providing R&D project guidance and implementation, Search consulting services, Training, and Search solutions using open source software like Apache Lucene/Solr, Elasticsearch, OpenSearch and Vespa.

Follow Us

Top Categories

Recent Posts

Monthly video

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

3 Responses

  1. Hi,
    I’m trying to experiment with the sow param but am experiencing something weird.
    For my queries that are processed by Solr with sow=false, I am able to set it to true and see the results change. But for those that end up in a “term centric” search, setting sow=false is not working for me. Even if i set sow=false, I still see term centric search happening in the parsed query and no changes in the result set. Any idea why this might be happening or how I can make it work please?
    Thanks

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.