This blog post is about the Solr classification module and the way Lucene classification has been integrated at indexing time.
N.B. This is supported from Solr 6.1
- Indexing time – through an Update Request Processor
- Query Time – through a Request handler ( similar to the More like This )
Classification Update Request Processor
It takes in input the added Document (to be indexed) and return the document after it has been processed by all the processors in the chain in sequence.
It is important because the processor has a view of the entire Document, so it can operate on all the fields the Document is composed.
- The latest Index Reader is retrieved from the latest opened Searcher
- A Lucene Document Classifier is instantiated with the config parameters in the solrconfig.xml
- A Class is assigned by the classifier taking in consideration all the relevant fields from the input document
- A new field is added to the original Document, with the class
- The Document goes through the next processing step
e.g. K Nearest Neighbours Classifier
e.g. Simple Naive Bayes Classifier
e.g. Update Handler Configuration
<requestHandler name=”/update” >
|This config param is mandatory||The list of fields (comma separated) to be taken in consideration for doing the classification.
Boosting syntax is supported for each field.
|This config param is mandatory||The field that contains the class of the document. It must appear in the indexed documents .
If knn algorithm it must be stored .
If bayes algorithm it must be indexed and ideally not heavily analysed.
|knn||The algorithm to use for the classification:
– knn ( K Nearest neighbours )
– bayes ( Simple Naive Bayes )
|10||Advanced – the no. of top docs to select in the MLT results to find the nearest neighbor|
|1||Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index|
|1||Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input|
E-commerce Search System ? Category assignation will require few human interaction after a valid initial corpus of products has been indexed with manually assigned category.
The possible usage for this Update Request Processor are countless.
In any scenario where we have documents with a class or category manually assigned in our Search System, the automatic Classification can be a perfect fit.
Leveraging the existent Index , the overhead for the Classification processing will be minimal.
After an initial human effort to have a good corpus of classified Documents, the Search System will be able to automatically index the class for the upcoming Documents.
Of course we must remember that for advanced classification scenarios that require in deep tuning, this solution can be not optimal.
 Text Categorization in Lucene/Solr
 Update Request Processors
Alessandro Benedetti is the founder of Sease Ltd.
Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.