Hi everyone! In this blog post, I would like to talk about the current usage of the Feature Vector Cache in Solr.
You can find a brief introduction to Learning To Rank in Solr in this blog post!
(This blog post has been written looking at the code that is likely to be in Apache Solr 9.0 and at the commit id=cfc953b6b90 in the main branch of the Solr git repository)
TL;DR;
At the moment the feature vector cache is only used when you enable the feature transformer in the fl parameter (both in insertions and lookup). It would be interesting to use the feature vector cache also at reranking time, independently of the feature transformer.
We are planning for a contribution.
LTR query performance
FEATURE VECTOR CACHE
The Feature Vector Cache benefits are not too intuitive reading the Apache Solr documentation. Let’s see how it works directly from the Solr code.
Insertion and lookups are done in a specific class called: org.apache.solr.ltr.FeatureLogger
Insertions are done in the org.apache.solr.ltr.FeatureLogger#log method:
Currently the condition on the org.apache.solr.ltr.LTRRescorer#scoreSingleHit method is always verified. We suspect it ended up this way due to code maintenance problems, we are investigating this more and we’ll update this post with more details later on.
Let’s see why.
Here is the called method:
org.apache.solr.ltr.LTRRescorer#scoreSingleHit
...
if (hitUpto < topN) {
reranked[hitUpto] = hit;
// if the heap is not full, maybe I want to log the features for this
// document
logHit = true;
} else if (hitUpto == topN) {
// collected topN document, I create the heap
heapify(reranked, topN);
}
...
This piece of code is called during the reranking phase of the topN documents (a parameter that corresponds to the rerankDocs value chosen at query time inside the rq paramater). Since we are reranking the topN documents, the if clause hitUpTo < topN is always verified for these topN docs (we are iterating on them) and therefore the logHit variable is always set to True for them.
Then a second condition arises.
In org.apache.solr.ltr.LTRRescorer#logSingleHit the existence of a FeatureLogger is needed as well as a SolrIndexSearcher is required.
The condition to have a SolrIndexSearcher is always accomplished so let’s focus on the FeatureLogger one.
A FeatureLogger is set if both 1 AND 2 are satisfied:
the feature transformer in the fl query parameter is used. In this case indeed we will have the extractFeatures variable in the org.apache.solr.ltr.search.LTRQParserPlugin.LTRQParser#parse method set to True from line 164.
the feature store defined in the feature transformer is the same as the feature store defined in the model OR just the feature transformer [features] component has been set, without specifying the store we want to be used.
Therefore to have an insertion, these are the two conditions to accomplish:
the feature transformer in the fl query parameter is used.
the feature store defined in the fl parameter is the same as the feature store defined in the model OR just the feature transformer [features] component has been set, without specifying the store to be used.
Only for the curious ones
If you want to go in-depth into these conditions you can take a look at the org.apache.solr.ltr.search.LTRQParserPlugin.LTRQParser#parse method. Here the extractFeatures variable is set in line 164:
final boolean extractFeatures = SolrQueryRequestContextUtils.isExtractingFeatures(req);
Then the computed extractFeatures is used in line 183, where the condition on the feature store is implemented:
And then the computed featuresRequestedFromSameStore is used to set the FeatureLogger at line 198.
Lookups
As mentioned before, lookups are part of the transformer process. To enable the transformer just pass the <transfomer name> (defined in the solrconfig.xml) in the fl parameter. e.g. fl=[features]
The first condition on the OriginalRankingLTRScoringQuery is always true when interleaving is not running and therefore we will always access the cache for lookups. Interleaving gives you the possibility of comparing a learned model with the original Solr score. The original Solr score doesn’t require features to be extracted, so unless hasExplicitFeatureStore is defined, the lookup wouldn’t be necessary.
Here we can see that the cache is only used for the FeatureLoggerTrasformer and not during the reranking phase done through the rq parameter. We are not making the reranking faster.
Therefore to have a lookup, these are the two conditions to accomplish:
the feature transformer in the fl query parameter is used.
the LTR model is a learned model (and not the original Solr score pre-reranking) OR a feature store has been exp
EXAMPLES
Let’s make some example queries to see how the Solr cache behaves.
First query
This is our first query. It has the [features] component defined in the fl query parameter and also the rq query parameter has been used.
For insertions we can see that:
Condition
Is condition verified
The transformer is defined in fl (‘[features=…]’)
YES
The transformerstore defined is the same as the model store (TRUE) OR the transformer [features] has been set, without specifying the store to be used (FALSE)
YES
All conditions for insertions are accomplished and therefore insertions happen.
For lookups we can see that:
Condition
Is condition verified
The transformer is defined in fl (‘[features=…]’)
YES
The model is a learned model (‘first_model’) (TRUE) OR An explicit store has been defined in the transformer (‘first_model_store’) (TRUE)
YES
All the conditions for lookups are accomplished and therefore lookups happen.
Second query
This is our second query. It has the [features] component defined in the fl query parameter and also the rq query parameter has been used.
For insertions we can see that:
Condition
Is condition verified
The transformer is defined in fl (‘[features=…]’)
YES
The transformerstore defined is the same as the model store (FALSE) OR the transformer [features] has been set, without specifying the store to be used (FALSE)
NO(the model store is the ‘first_model_store’)
Not all conditions for insertions are accomplished and therefore insertions will not be made.
For lookups we can see that:
Condition
Is condition verified
The transformer is defined in fl (‘[features=…]’)
YES
The model is a learned model (‘first_model’) (TRUE) OR An explicit store has been defined in the transformer (‘second_model_store’) (TRUE)
YES (N.B. the model and transformer stores are different)
All the conditions for lookups are accomplished and therefore lookups happen.
Third query
This is our third query. It has the [features] component defined in the fl query parameter but no rq query parameter has been used.
For insertions we can see that:
Condition
Is condition verified
The transformer is defined in fl (‘[features=…]’)
YES
The transformerstore defined is the same as the model store (FALSE) OR the transformer [features] has been set, without specifying the store to be used (FALSE)
NO(there is no rq parameter)
Not all conditions for insertions are accomplished and therefore insertions will not be made.
For lookups we can see that:
Condition
Is condition verified
The transformer is defined in fl (‘[features=…]’)
YES
The model is a learned model (FALSE) OR An explicit store has been defined in the transformer (‘second_model_store’) (TRUE)
YES
The first and second conditions for lookups are accomplished and therefore lookups happen.
Fourth query
This is our fourth query. It has no [features] component defined in the fl query parameter but the rq query parameter has been used.
For insertions we can see that:
Condition
Is condition verified
The transformer is defined in fl (‘[features=…]’)
NO
The transformerstore defined is the same as the model store (FALSE) OR the transformer [features] has been set, without specifying the store to be used (FALSE)
NO
Not all conditions for insertions are accomplished and therefore insertions will not be made.
For lookups we can see
Condition
Is condition verified
The transformer is not defined in fl (‘[features=…]’)
NO
The model is a learned model (TRUE) OR An explicit store has been defined in the transformer (FALSE)
YES
Not all conditions for lookups are accomplished and therefore lookups will not be made.
Future works
We will open soon a Jira issue to integrate the Feature Vector Cache in the reranking phase, independently of the feature transformer.
Did you like this post about how the Feature Vector Cache Is Used in Apache Solr? Don’t forget to subscribe to our Newsletter to stay always updated on the Information Retrieval world!
Related
Author
Anna Ruggero
Anna Ruggero is a software engineer passionate about Information Retrieval and Data Mining.
She loves to find new solutions to problems, suggesting and testing new ideas, especially those that concern the integration of machine learning techniques into information retrieval systems.