Apache Solr, Main Blog

Why Solr More Like This Ignores Your copyField – How to Deal With It

When using CloudMLTQParser (the default More Like This (MLT) query parser when using SolrCloud), fields that are populated as copyField destinations, are not taken into account when constructing the More Like This query.

This issue arises because the CloudMLTQParser relies on a RealTime Get request to fetch the source document by ID. In a distributed environment the MLT request might be handled by a shard that does not host the target document, prompting a RealTime Get call to fetch it.
However, the document returned by RealTime Get includes only the original fields and excludes any content added via copyField, since such fields are not part of the original SolrInputDocument. As a result, the CloudMLTQParser silently ignores these fields, and the generated MLT query ends up empty.

This behaviour differs from SimpleMLTQParser (used in Solr standalone), which does not rely on RealTime Get but works directly with the indexed document (Index Reader).

It first attempts to retrieve term vectors; if they are not available, it falls back to extracting stored field content and reapplying the analysis chain to compute term frequencies [code].

Bug Description

Jira Issue

Suppose you are using the description field in a More Like This query, but you realize you need to adjust the text analysis to better fit your current needs. To do this, you decide to create a new field with a custom field type.

You use Solr’s copyField functionality [1] to populate the new field descriptionMLT from description.

Schema:

				
					<field name="description" type="text_general" indexed="true" stored="true"/>
<field name="descriptionMLT" type="text_general_mlt" indexed="true" stored="true"/>

<copyField source="description" dest="descriptionMLT"/>

Once the schema changes are applied, including the reindexing of your documents, you are ready to run a More Like This (MLT) query using the new field to take advantage of the updated text analysis:

				
					/select?q={!mlt qf=descriptionMLT}doc_id

The resulting parsed query will be empty:

				
					"parsedquery": "+() -id:{doc_id}"

This happens because, when retrieving the document with a RealTime Get request, the descriptionMLT field is not present in the document. As a result, the More Like This component has no terms to build the query with, leading to no results being returned.

Immediate Workaround

If upgrading to the latest Solr version — where this issue has been resolved — is not an option, a temporary workaround is to avoid relying on copyField for any fields intended to be used in MLT queries.
Instead, these fields should be explicitly populated at indexing time to ensure they contain the necessary terms for query generation.

Bug Fix

To address this bug, there are potentially two solutions:

One approach is to workaround the current behaviour of RealTime Get by ensuring the necessary field values are directly retrieved from the source document, rather than relying on copyField.
Alternatively, a more structural fix could involve changing how RealTime Get works — for example, by having it return copyField values as well.

In agreement with some committers, it was decided to proceed with the first solution to avoid, for the time being, changing the behaviour of the RealTime Get Component, where copyField targets are explicitly excluded, as shown here.

This choice was made because copyField targets are effectively just duplicated data.
We could consider modifying the RealTime Get behaviour in the future by introducing a parameter like includeCopyFields (defaulting to false), to preserve the current behaviour while still offering flexibility when those fields are explicitly needed. However, such a change would be evaluated and addressed in a separate pull request.

Workaround Implementation

The issue was addressed by introducing a helper method:
getFieldValuesIncludingCopyField that simply retrieves field values from the source document returned by RealTime Get.

If the requested field (passed in the qf parameter of the MLT query) is present in the document returned by the RTG, its values are returned directly. However, in cases where the field is not explicitly present—such as when it is populated only via copyField—the method inspects the schema to retrieve all source fields that copy into the target field. It then collects and aggregates their values.
These values are then used to construct the input for MoreLikeThis.like ensuring that all content is properly considered when building the MLT query.

Validation and Testing

To ensure the correctness and robustness of the workaround, a set of focused tests were introduced validating the behaviour of the MoreLikeThis (MLT) query when copyField destinations are used in the qf parameter.

The tests cover the following key scenarios:

CopyField destination in qf retrieves values from source fields and returns expected results:
Verified that when the qf parameter references a field populated via copyField, the field values are correctly retrieved from the source field and the MLT query successfully returns the expected results.
The generated MLT query is non-empty and correctly built:
Ensured that the underlying parsed query includes the expected tokens from the source fields and avoids falling back to an empty or invalid query structure.
Missing source field results in no matches:
Confirmed that when the source field of a copyField destination is missing from the document, no results are returned.
Multiple copyField sources are correctly aggregated:
Validated that when a destination field is populated from multiple source fields, all source values are combined and used in the generated MLT query.

General MLT Bug Identified for Future Fix

While investigating this issue, a more general limitation of the MoreLikeThis (MLT) feature emerged.
Specifically, MLT relies on term statistics such as document frequency (DF), which indicates how many documents contain a given term within the index.

In SolrCloud, however, when requesting similar documents by ID, the query may be received by a shard that doesn’t host the target document for similarity comparison. While this allows accurate calculation of term frequency (TF) for the document, the document frequency (DF) used in the query is local to the node handling the request, not global across the entire index.

This results in incomplete and potentially misleading similarity scoring, as Lucene does not natively support distributed document frequency computation.

Need Help with this topic?

If you're struggling with Solr More Like This, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Click Here

Need Help With This Topic?

If you’re struggling with Solr More Like This, don’t worry – we’re here to help!
Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

apache solr, bugfix, cloudMLTQParser, copyField, information retrieval, lucene, morelikethis, search, solr, solrCloud

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Why Solr More Like This Ignores Your copyField – How to Deal With It

Bug Description

Schema:

Immediate Workaround

Bug Fix

Workaround Implementation

Validation and Testing

General MLT Bug Identified for Future Fix

Need Help with this topic?

Need Help With This Topic?

Other posts you may find useful

Apache Solr Multivalued Vectors Tutorial

Build a Text Search API from a Postgres Database

Explaining Learning to Rank Models with Tree Shap

Ilaria Petreti

Ilaria Petreti

Follow Us

Top Categories

Recent Posts

Apache Solr Multivalued Vectors Tutorial

Protected: Bloomberg Sponsorship Spotlight: Our Latest Apache Solr Contributions

Lexically accelerated vector search: SeededKnnVectorQuery Support in Apache Solr 10

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Why Solr More Like This Ignores Your copyField – How to Deal With It

Bug Description

Schema:

Immediate Workaround

Bug Fix

Workaround Implementation

Validation and Testing

General MLT Bug Identified for Future Fix

Need Help with this topic?​

Need Help With This Topic?​​

Other posts you may find useful

Apache Solr Multivalued Vectors Tutorial

Build a Text Search API from a Postgres Database

Explaining Learning to Rank Models with Tree Shap

Ilaria Petreti

Ilaria Petreti

Follow Us

Top Categories

Recent Posts

Apache Solr Multivalued Vectors Tutorial

Protected: Bloomberg Sponsorship Spotlight: Our Latest Apache Solr Contributions

Lexically accelerated vector search: SeededKnnVectorQuery Support in Apache Solr 10

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help with this topic?

Need Help With This Topic?