When using CloudMLTQParser (the default More Like This (MLT) query parser when using SolrCloud), fields that are populated as copyField destinations, are not taken into account when constructing the More Like This query.
This issue arises because the CloudMLTQParser relies on a RealTime Get request to fetch the source document by ID. In a distributed environment the MLT request might be handled by a shard that does not host the target document, prompting a RealTime Get call to fetch it.
However, the document returned by RealTime Get includes only the original fields and excludes any content added via copyField, since such fields are not part of the original SolrInputDocument. As a result, the CloudMLTQParser silently ignores these fields, and the generated MLT query ends up empty.
This behaviour differs from SimpleMLTQParser (used in Solr standalone), which does not rely on RealTime Get but works directly with the indexed document (Index Reader).
It first attempts to retrieve term vectors; if they are not available, it falls back to extracting stored field content and reapplying the analysis chain to compute term frequencies [code].
Bug Description
Suppose you are using the description field in a More Like This query, but you realize you need to adjust the text analysis to better fit your current needs. To do this, you decide to create a new field with a custom field type.
You use Solr’s copyField functionality [1] to populate the new field descriptionMLT from description.
Schema:
Once the schema changes are applied, including the reindexing of your documents, you are ready to run a More Like This (MLT) query using the new field to take advantage of the updated text analysis:
/select?q={!mlt qf=descriptionMLT}doc_id
The resulting parsed query will be empty:
"parsedquery": "+() -id:{doc_id}"
This happens because, when retrieving the document with a RealTime Get request, the descriptionMLT field is not present in the document. As a result, the More Like This component has no terms to build the query with, leading to no results being returned.
Immediate Workaround
If upgrading to the latest Solr version — where this issue has been resolved — is not an option, a temporary workaround is to avoid relying on copyField for any fields intended to be used in MLT queries.
Instead, these fields should be explicitly populated at indexing time to ensure they contain the necessary terms for query generation.
Bug Fix
To address this bug, there are potentially two solutions:
- One approach is to workaround the current behaviour of RealTime Get by ensuring the necessary field values are directly retrieved from the source document, rather than relying on
copyField. - Alternatively, a more structural fix could involve changing how RealTime Get works — for example, by having it return
copyFieldvalues as well.
In agreement with some committers, it was decided to proceed with the first solution to avoid, for the time being, changing the behaviour of the RealTime Get Component, where copyField targets are explicitly excluded, as shown here.
This choice was made because copyField targets are effectively just duplicated data.
We could consider modifying the RealTime Get behaviour in the future by introducing a parameter like includeCopyFields (defaulting to false), to preserve the current behaviour while still offering flexibility when those fields are explicitly needed. However, such a change would be evaluated and addressed in a separate pull request.
Workaround Implementation
The issue was addressed by introducing a helper method:getFieldValuesIncludingCopyField that simply retrieves field values from the source document returned by RealTime Get.
If the requested field (passed in the qf parameter of the MLT query) is present in the document returned by the RTG, its values are returned directly. However, in cases where the field is not explicitly present—such as when it is populated only via copyField—the method inspects the schema to retrieve all source fields that copy into the target field. It then collects and aggregates their values.
These values are then used to construct the input for MoreLikeThis.like ensuring that all content is properly considered when building the MLT query.
Validation and Testing
To ensure the correctness and robustness of the workaround, a set of focused tests were introduced validating the behaviour of the MoreLikeThis (MLT) query when copyField destinations are used in the qf parameter.
The tests cover the following key scenarios:
- CopyField destination in
qfretrieves values from source fields and returns expected results:
Verified that when theqfparameter references a field populated viacopyField, the field values are correctly retrieved from the source field and the MLT query successfully returns the expected results. - The generated MLT query is non-empty and correctly built:
Ensured that the underlying parsed query includes the expected tokens from the source fields and avoids falling back to an empty or invalid query structure. - Missing source field results in no matches:
Confirmed that when the source field of acopyFielddestination is missing from the document, no results are returned. - Multiple
copyFieldsources are correctly aggregated:
Validated that when a destination field is populated from multiple source fields, all source values are combined and used in the generated MLT query.
General MLT Bug Identified for Future Fix
While investigating this issue, a more general limitation of the MoreLikeThis (MLT) feature emerged.
Specifically, MLT relies on term statistics such as document frequency (DF), which indicates how many documents contain a given term within the index.
In SolrCloud, however, when requesting similar documents by ID, the query may be received by a shard that doesn’t host the target document for similarity comparison. While this allows accurate calculation of term frequency (TF) for the document, the document frequency (DF) used in the query is local to the node handling the request, not global across the entire index.
This results in incomplete and potentially misleading similarity scoring, as Lucene does not natively support distributed document frequency computation.
Need Help with this topic?
Need Help With This Topic?
If you’re struggling with Solr More Like This, don’t worry – we’re here to help!
Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!





