Apache Solr, Main Blog

Apache Solr sow Parameter (Split on Whitespace) and Multi-Field Full-Text Search

How eDismax sow parameter works

Sow (split on whitespace) is an eDismax query parser parameter [1] that regulates aspects of query time text analysis that impact how the user query is parsed and the internal Lucene query is built.
It is particularly relevant in multi-term and multi-field search.
If sow=true :

- first, the user query text is tokenized with a split-on whitespace approach
- then for each term in the token stream N disjunction clauses are built, one for each of the query fields involved in the search
- query time text analysis happens (so the text analysis takes in input a single term per field)

e.g.

- field1 has multi-term synonyms configured: uk, united kingdom, england, london, british, britain
- field2 has English stemming configured for query time text analysis

				
					sow = true
qf = field1 field2
q = united kingdom
defType = edismax

"parsedquery_toString":"+((field1:united | field2:unit) (field1:kingdom | field2:kingdom))"

No multi-term synonym expansion happened for field1, which is expected because the text analysis chain was called two separate times on single terms: field1:united and field1:kingdom.
The text analysis chain for field1 has never seen the text “united kingdom”, so no synonym expansion happened.
On the other hand, text analysis for field2 has produced stemmed terms (field2:united => field2:unit), which is expected as stemming is applied token by token.

If sow=false :

- the user query is not tokenized
- a disjunction clause with the entire text is built for each one of the query fields involved in the search
- query time text analysis happens (so the text analysis, for each field, takes in input the full text)

e.g.

- field1 has multi-term synonyms configured: uk, united kingdom, england, london, british, britain
- field2 has English stemming configured for query time text analysis
- both have autoGeneratePhraseQueries=”true” (this is needed for expanded multi-term synonyms)

				
					sow = false
qf = field1 field2
q = united kingdom
defType = edismax

"parsedquery_toString":
"+((field2:unit field2:kingdom) | 
((field1:uk field1:\"united kingdom\" field1:england field1:london field1:british field1:britain)))",

Multi-term synonym expansion happened for field1, which is expected because the text analysis chain was called once, with the full text: field1:(united kingdom).
The text analysis chain for field1 analysed the text “united kingdom” and the analysis chain took care of synonym expansion.

When is sow=false strictly necessary?

If you have text analysis chains configured at query time, that need the full text in input to work properly, because there are token filters that involve multiple sequential tokens to me transformed into new ones, you need to avoid initial white space tokenization, so sow=false is mandatory:
e.g.

- Synonyms mapping from multi-terms: united kingdom => uk, united kingdom, england, london, british, britain
- shingle token filtering [2]: united kingdom exploration =>united kingdom, kingdom exploration

If your text analysis is not compatible with white space tokenization, you must use sow=false
e.g.

				
					<fieldType name="text_keyword" class="solr.TextField" positionIncrementGap="100">
       <analyzer>
         <tokenizer class="solr.KeywordTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
       </analyzer>
     </fieldType>

using sow=true will just build multiple query clauses on terms that won’t match the index:
e.g.

				
					sow = true
qf = author_keyword
q = united kingdom
defType = edismax

"parsedquery_toString":
"+((author_keyword:united) (author_keyword:kingdom))"

using sow=false is mandatory in this case

				
					sow = false
qf = author_keyword
q = united kingdom
defType = edismax

"parsedquery_toString":
"+(author_keyword:united kingdom)",

When sow=false is not enough

If your field type is a string (which is not analyzed) sow=false won’t help.
This is a current bug [3]
e.g.

				
					<field name="author_s" type="string" indexed="true" stored="true" multiValued="false" />

using sow=true will just build multiple query clauses on terms that won’t match the index:
e.g.

				
					sow = true
qf = author_s
q = united kingdom
defType = edismax

"parsedquery_toString":
"+((author_s:united) (author_s:kingdom))"

using sow=false is also failing

				
					sow = false
qf = author_s
q = united kingdom
defType = edismax

"parsedquery_toString":
"+((author_s:united) (author_s:kingdom))"

Why should you set sow=true?

Given the evident advantages of setting sow=false, why does this parameter exist? i.e. why and when you should set it to true?

The following considerations assume the tie parameter [4] is set to the default 0.
Further considerations on the topic follow.

- If your text analysis produces for all query fields, the exact same amount of tokens, you won’t see any difference in setting sow=true or sow=false
- if your text analysis produces from the input text a different amount of tokens per field, the difference kicks in: if you want documents containing more query terms(not necessarily in the same field) to score generally higher you should use sow=true.
  sow=true each term contributes to the score once (if it matches twice, the field that scores best is taken)
  You should prefer this strategy if generally, you prefer documents that match more query terms to be favored.
  Does this mean that having more terms will always win over fewer terms? NO, not necessarily because a match of one field in a super rare field, may still dominate because there’s no coord factor anymore [5]
  The coord factor was used to normalize the contribution of each boolean clause to 1/n where n was the number of query terms involved.
  sow=false only one field contributes to the score (the one that scores best)
- if you want to search for multiple values at the same time in fields keyword tokenized

Document Frequency

What currently affects which field scores best? BM25 scoring.
BM25 is the scoring algorithm used by Apache Solr and Lucene. This algorithm used the document frequency of a term to basically estimate how important a term is among the other query terms. So the rarer in the corpus a query term is, the more important it is considered for the overall query.
This is fine for a single-field search.
But using it in Disjunctions queries brings a nasty effect. Document Frequency is used to select the best field that matches a term (even a single term). And it does that in a counter-intuitive way: the field where the term appears the less is considered the best.
Consequences are evident:

Corpus
Document: Comics Issue
Fields: id, title, heroes, villains

Query
Text: batman
Query Fields: heroes, villains

This results in a disjunction query: (heroes: batman | villains: batman)
So the score of each document will be the highest score among the potential heroes: batman match and the potential villains: batman match.
Let’s assume all other factors (term frequency, average document length,…) are a tie, Document Frequency will be the discriminant in deciding the best field match for each candidate document that needs to be ranked. And currently, it works the opposite a user would intuitively expect: comics issues where batman was the villain of the story will skyrocket to the top.
Inversed Document Frequency is used to select the best field for a term rather than the most important term in a multi-term query.

Tie parameter

The tie parameter [4] flattens the impact of the “best” match among all other matches.
In the extreme case tie=1 when all matches just contribute as a sum, sow=true or sow=false becomes un-influential for scoring purposes (all the multi-term token filtering observations remain).

MM (minimum should match)

The sow parameter affects the mm parameter [6].
When the query parsed moves from being term-centric (sow=true) to field-centric (sow=false and different text analysis), mm means two different things:
minimum of query terms matched, independently in which field

				
					sow = true
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax

"parsedquery_toString":
"+(((author:united | subjects_as_same_term:united) (author:kingdom | subjects_as_same_term:kingdom))~2)"

				
					"response":{"numFound":2,"start":0,"maxScore":7.757958,"numFoundExact":true,"docs":[
      {
        "id":"888888",
        "author":"united",
        "subjects":["kingdom"],
        "score":7.757958},
      {
        "id":"77777",
        "author":"united kingdom",
        "score":5.874222}]
  },

minimum of query terms matched within the same field (i.e. all query terms required must be found in one of the fields)

				
					sow = false
mm=2
qf = author subjects_as_same_term
q = united kingdom
defType = edismax

"parsedquery_toString":
"+(((author:united author:kingdom)~2) | 
(((subjects_as_same_term:uk subjects_as_same_term:\"united kingdom\" 
subjects_as_same_term:england subjects_as_same_term:london 
subjects_as_same_term:british subjects_as_same_term:britain))~1))"

This (author:united author:kingdom)~2 means we need both the clauses to match to have a good candidate, in disjunction with
(subjects_as_same_term:uk subjects_as_same_term:\”united kingdom\” subjects_as_same_term:england subjects_as_same_term:london subjects_as_same_term:british subjects_as_same_term:britain))~1 that means we need at least one clause to match (because synonyms expanded the two original terms into a single one)

				
					"response":{"numFound":1,"start":0,"maxScore":5.874222,"numFoundExact":true,"docs":[
      {
        "id":"77777",
        "author":"united kingdom",
        "score":5.874222}]
  }

Outstanding problems

The eDismax query parser is a tool that offers many configuration points and it probably ended up being too complicated nowadays.
Let’s summarize the current outstanding problems:

- if the query fields share the same text analysis and no multi-term token filtering is involved, sow=true or sow=false makes no difference
- since coord has been removed, sow=true won’t necessarily favor multi terms matches that strongly anymore
- when a disjunction is involved in choosing the best field for a query term, due to BM25 Inversed Document Frequency the “less popular” field match is always chosen when tie-breaking

Next steps

We are going to continue working on the subject, follow our blog and research for more info!
Is a new query parser coming to the Apache Solr world?
Stay Tuned!

Need Help With This Topic?

If you’re struggling with the split on whitespace (sow) parameter, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with the split on whitespace (sow) parameter, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Click Here

apache lucene, apache solr, bm25, edismax, query, query parsers

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

3 Responses

Rashmeet says:

May 19, 2022 at 8:33 pm

Hi,
I’m trying to experiment with the sow param but am experiencing something weird.
For my queries that are processed by Solr with sow=false, I am able to set it to true and see the results change. But for those that end up in a “term centric” search, setting sow=false is not working for me. Even if i set sow=false, I still see term centric search happening in the parsed query and no changes in the result set. Any idea why this might be happening or how I can make it work please?
Thanks

Loading...

Reply
1. Alessandro Benedetti says:
  
  May 30, 2022 at 12:58 pm
  
  Hi Rashmeet, thank’s for your comment,
  “for those that end up in a “term centric” search” can you give an example?
  
  Loading...
  
  Reply
Pingback: Resolved: Which analyzer is used while using fuzzy operator with query_string clause? - Daily Developer Blog

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Apache Solr sow Parameter (Split on Whitespace) and Multi-Field Full-Text Search

How eDismax sow parameter works

When is sow=false strictly necessary?

When sow=false is not enough

Why should you set sow=true?

Document Frequency

Tie parameter

MM (minimum should match)

Outstanding problems

Next steps

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

Synonyms + Stopwords?? OMG!

Neo4J Optimization Tips

Entity Search with graph embeddings – Part 1 – Overview

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

London Information Retrieval & AI Meetup [November 2025]

GLiNER as an Alternative to LLMs for Query Parsing – Evaluation

GLiNER as an Alternative to LLMs for Query Parsing – Introduction

Monthly video

Sign up for our Newsletter

3 Responses

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?