Solr Document Classification – Part 1 – Indexing Time
Introduction
This blog post is about the Solr classification module and the way Lucene classification has been integrated at indexing time.
N.B. This is supported from Solr 6.1
Solr Document Classification
- Indexing time – through an Update Request Processor
- Query Time – through a Request handler ( similar to the More like This )
Classification Update Request Processor
It takes in input the added Document (to be indexed) and return the document after it has been processed by all the processors in the chain in sequence.
It is important because the processor has a view of the entire Document, so it can operate on all the fields the Document is composed.
Description
- The latest Index Reader is retrieved from the latest opened Searcher
- A Lucene Document Classifier is instantiated with the config parameters in the solrconfig.xml
- A Class is assigned by the classifier taking in consideration all the relevant fields from the input document
- A new field is added to the original Document, with the class
- The Document goes through the next processing step
Configuration
K Nearest Neighbours Classifier
<updateRequestProcessorChain name=”classification”>
<processor class=”solr.ClassificationUpdateProcessorFactory”>
<str name=”inputFields”>title^1.5,content,author</str>
<str name=”classField”>cat</str>
<str name=”algorithm”>knn</str>
<str name=”knn.k”>20</str>
<str name=”knn.minTf”>1</str>
<str name=”knn.minDf”>5</str>
</processor>
</updateRequestProcessorChain>
Simple Naive Bayes Classifier
<updateRequestProcessorChain name=”classification”>
<processor class=”solr.ClassificationUpdateProcessorFactory”>
<str name=”inputFields”>title^1.5,content,author</str>
<str name=”classField”>cat</str>
<str name=”algorithm”>bayes</str>
</processor>
</updateRequestProcessorChain>
Update Handler Configuration
<requestHandler name=”/update” >
<lst name=”defaults”>
<str name=”update.chain”>classification</str>
</lst>
</requestHandler>
Parameter | Default | Description |
---|---|---|
inputFields | This config param is mandatory | The list of fields (comma separated) to be taken in consideration for doing the classification. Boosting syntax is supported for each field. |
classField | This config param is mandatory | The field that contains the class of the document. It must appear in the indexed documents . If knn algorithm it must be stored . If bayes algorithm it must be indexed and ideally not heavily analysed. |
algorithm | knn | The algorithm to use for the classification: – knn ( K Nearest neighbours ) – bayes ( Simple Naive Bayes ) |
knn.k | 10 | Advanced – the no. of top docs to select in the MLT results to find the nearest neighbor |
knn.minDf | 1 | Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index |
knn.minTf | 1 | Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input |
Usage
Indexing News Documents ? we can use the already indexed news with category, to automatically tag upcoming stories with no human intervention.
E-commerce Search System ? Category assignation will require few human interaction after a valid initial corpus of products has been indexed with manually assigned category.
The possible usage for this Update Request Processor are countless.
In any scenario where we have documents with a class or category manually assigned in our Search System, the automatic Classification can be a perfect fit.
Leveraging the existent Index , the overhead for the Classification processing will be minimal.
After an initial human effort to have a good corpus of classified Documents, the Search System will be able to automatically index the class for the upcoming Documents.
Of course we must remember that for advanced classification scenarios that require in deep tuning, this solution can be not optimal.
Code
Shameless plug for our training and services!
Did I mention we do Apache Solr Beginner and Elasticsearch Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!
Subscribe to our newsletter
Did you like this post about Solr Document Classification at indexing time? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!
Related
Author
Alessandro Benedetti
Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.
Comments (14)
Leave a comment Cancel reply
This site uses Akismet to reduce spam. Learn how your comment data is processed.
tonnebrre
May 3, 2016thanks for your article
tonnebrre
May 10, 2016how can i apply the patch
Alessandro Benedetti
May 10, 2016You don't need , this is part of the official Solr already 🙂
Cheers
tonnebrre
May 12, 2016please tell me witch solr version you use ?
Alessandro Benedetti
May 12, 2016It is already in the Solr code (trunk) .
Not sure it is in any release yet !
Cheers
tonnebrre
May 16, 2016thanks a lot
Tomas Ramanauskas
May 27, 2016Hi, Alessandro, can you share some example on how to use this feature?
I never used Solr before, but today I downloaded solr-6.1.0-68 which I think already contains your modifications.
I then created a demo core:
./solr create -c demo
And also modified solr/demo/conf/solrconfig.xml file and added:
title_t^1.5,author_s
cat_s
bayes
I loaded few documents:
curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book1”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“cat_s”:”fantasy”,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'
curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book2”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“cat_s”:”fantasy”,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'
curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book3”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“cat_s”:”fantasy”,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'
But what query shall I use to see the auto classification results?
Tomas Ramanauskas
May 27, 2016From the slide 28 in http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr presentation I see that the category is automatically assigned if it doesn't exist in the category field, but I don't get anything assigned if I add new documents:
curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book4”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'
curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book5”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“cat_s”:””,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'
manohar c
June 30, 2016Hi Alessandro,
i added classification in solr , but it is showing error like ” Load error: Error loading class 'ClassificationUpdateProcessorFactory”,
here is my solrconfig.xml.
case_title^1.5,case_history
Issue_Group
bayes
classification
is solr comes with classification algorithm? or should i need to add jar file in solr path?
Please help me to do it.
Thanks and Regards.
Alessandro Benedetti
June 30, 2016Hi Manohar,
Which version of Solr are you using ?
Without applying any patch you need Solr 6.1 .
Following the blog documentation should be enough to have it working, in the case you need some help, let me know!
Cheers
manohar c
June 30, 2016Hi Alessandro,
Thanks for quick response.
I am using solr 5.4.1, it is showing error like : “SolrException: Error loading class 'ClassificationUpdateProcessorFactory'”
How can i add ClassificationUpdateProcessorFactory algorithm in solr 5.4.1
I have different types of categories like Battery failure, Fan failure, HDD, MEMORY , Power Supply etc., in a seperate file.
I am pulling documents from sql server, i need to assign those categories to my documents.
let say
{“id” : “5463789”,
“case_history” fan related issue,
“Issue_Group”:”Fan failure”,
}
{“id” : “5463789”,
“case_history” memory related issues ,
“Issue_Group”:”MEMORY”,
}
Here my solrconfig.xml :
case_history
Issue_Group
knn
10
1
5
classification
Thanks in advance.
manohar c
July 4, 2016I am using Solr 5.4.1 , it is showing error like : “SolrException: Error loading class 'ClassificationUpdateProcessorFactory'” .
Can you please tell me , how can i do “document classification” in solr 5.4.1
Rizwaan Adil
August 17, 2016I am facing the same problem. My classification field Issue_Group is not getting populated after the data pull is over.
No error messages are noticed in the logs either.
lehuyen
October 14, 2016Could you please give me an example how to classification in SOLR???