Apache Solr, Main Blog

Solr Document Classification – Part 1 – Indexing Time

Introduction to Solr Document Classification

This blog post is about the Solr classification module and the way Lucene classification has been integrated at indexing time.

Previously we have explored the world of Lucene Classification and the extension to use it for Document Classification.

It comes naturally to integrate Solr with the Classification module and allow Solr users to easily manage the Classification out of the box.

N.B. This is supported from Solr 6.1

Solr Document Classification

Taking inspiration from the work of a dear friend [2], integrating the classification in Solr can happen 2 sides :

Indexing time – through an Update Request Processor
Query Time – through a Request handler ( similar to the More like This )

In this article we are going to explore the Indexing time integration :

The Classification Update Request Processor.

Classification Update Request Processor

First of all, let’s describe some basic concepts :

An Update Request Processor Chain, associated with an Update handler, is a pipeline of Update processors, that will be executed in sequence.
It takes in input the added Document (to be indexed) and returns the document after it has been processed by all the processors in the chain in sequence.

Finally, the document is indexed.

An Update Request Processor is the unit of processing of a chain, it takes in input a Document and operates some processing before it is passed to the following processor in the chain if any.

The main reason for the Update processor is to add intermediate processing steps that can enrich, modify and possibly filter documents before they are indexed.
It is important because the processor has a view of the entire Document, so it can operate on all the fields the Document is composed.

For further details, follow the official documentation [3].

Description

The Classification Update Request Processor is a simple processor that will automatically classify a document (the classification will be based on the latest index available) by adding a new field containing the class before the document is indexed.

After an initial valuable index has been built with human-assigned labels to the documents, thanks to this Update Request Processor will be possible to ingest documents with automatically assigned classes.

The processing steps are quite simple :

When a document to be indexed enters the Update Processor Chain and arrives at the Classification step, this sequence of operations will be executed :

The latest Index Reader is retrieved from the latest opened Searcher
A Lucene Document Classifier is instantiated with the config parameters in the solrconfig.xml
A Class is assigned by the classifier taking into consideration all the relevant fields from the input document
A new field is added to the original Document, with the class
The Document goes through the next processing step.

Configuration

K Nearest Neighbours Classifier

				
					<updateRequestProcessorChain name="classification">
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">title^1.5,content,author</str>
<str name="classField">cat</str>
<str name="algorithm">knn</str>
<str name="knn.k">20</str>
<str name="knn.minTf">1</str>
<str name="knn.minDf">5</str>
</processor>
</updateRequestProcessorChain>

Simple Naive Bayes Classifier

				
					<updateRequestProcessorChain name="classification">
<processor class="solr.ClassificationUpdateProcessorFactory">
<str name="inputFields">title^1.5,content,author</str>
<str name="classField">cat</str>
<str name="algorithm">bayes</str>
</processor>
</updateRequestProcessorChain>

Update Handler Configuration

				
					<requestHandler name="/update" >
<lst name="defaults">
<str name="update.chain">classification</str>
</lst>
</requestHandler>

Parameter	Default	Description
inputFields	This config param is mandatory	The list of fields (comma separated) to be taken in consideration for doing the classification. Boosting syntax is supported for each field.
classField	This config param is mandatory	The field that contains the class of the document. It must appear in the indexed documents . If knn algorithm it must be stored . If bayes algorithm it must be indexed and ideally not heavily analysed.
algorithm	knn	The algorithm to use for the classification: – knn ( K Nearest neighbours ) – bayes ( Simple Naive Bayes )
knn.k	10	Advanced – the no. of top docs to select in the MLT results to find the nearest neighbor
knn.minDf	1	Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index
knn.minTf	1	Advanced – A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input

Usage

Indexing News Documents? We can use the already indexed news with category to automatically tag upcoming stories with no human intervention.
E-commerce Search System? Category assignation will require a few human interactions after a valid initial corpus of products has been indexed with manually assigned categories.
The possible uses for this Update Request Processor are countless.
In any scenario where we have documents with a class or category manually assigned in our Search System, the automatic Classification can be a perfect fit.
Leveraging the existent Index, the overhead for the Classification processing will be minimal.
After an initial human effort to have a good corpus of classified Documents, the Search System will be able to automatically index the class for the upcoming Documents.
Of course, we must remember that for advanced classification scenarios that require deep tuning, this solution can be not optimal.

Code

The patch is attached to this Jira Issue :

SOLR-7739

This has been officially merged to Apache Solr starting with 6.1. version.

Need Help With This Topic?

If you’re struggling with document classification in Apache Solr, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with document classification in Apache Solr, don't worry - we're here to help! Our team offers expert services and training to help you optimize your Solr search engine and get the most out of your system. Contact us today to learn more!

Click Here

apache lucene, apache solr, classification, indexing, machine learning, search, Update request processor

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

read other blog posts of this author

Alessandro Benedetti

read other blog posts of this author

We are Sease, an Information Retrieval Company based in London, focused on providing R&D project guidance and implementation, Search consulting services, Training, and Search solutions using open source software like Apache Lucene/Solr, Elasticsearch, OpenSearch and Vespa.

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

14 Responses

tonnebrre says:

May 3, 2016 at 2:30 pm

thanks for your article

Loading...

Reply
tonnebrre says:

May 10, 2016 at 3:07 pm

how can i apply the patch

Loading...

Reply
Alessandro Benedetti says:

May 10, 2016 at 4:01 pm

You don't need , this is part of the official Solr already 🙂

Cheers

Loading...

Reply
tonnebrre says:

May 12, 2016 at 10:40 am

please tell me witch solr version you use ?

Loading...

Reply
Alessandro Benedetti says:

May 12, 2016 at 11:40 am

It is already in the Solr code (trunk) .
Not sure it is in any release yet !

Cheers

Loading...

Reply
tonnebrre says:

May 16, 2016 at 9:28 am

thanks a lot

Loading...

Reply
Tomas Ramanauskas says:

May 27, 2016 at 1:47 pm

Hi, Alessandro, can you share some example on how to use this feature?

I never used Solr before, but today I downloaded solr-6.1.0-68 which I think already contains your modifications.

I then created a demo core:

./solr create -c demo

And also modified solr/demo/conf/solrconfig.xml file and added:

title_t^1.5,author_s
cat_s
bayes

I loaded few documents:

curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book1”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“cat_s”:”fantasy”,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'

curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book2”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“cat_s”:”fantasy”,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'

curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book3”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“cat_s”:”fantasy”,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'

But what query shall I use to see the auto classification results?

Loading...

Reply
Tomas Ramanauskas says:

May 27, 2016 at 2:08 pm

From the slide 28 in http://www.slideshare.net/teofili/text-categorization-with-lucene-and-solr presentation I see that the category is automatically assigned if it doesn't exist in the category field, but I don't get anything assigned if I add new documents:

curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book4”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'

curl http://localhost:8984/solr/demo/update -d '
[
{“id” : “book5”,
“title_t”:[“The Way of Kings”],
“author_s”:”Brandon Sanderson”,
“cat_s”:””,
“pubyear_i”:2010,
“ISBN_s”:”978-0-7653-2635-5″
}
]'

Loading...

Reply
manohar c says:

June 30, 2016 at 1:28 pm

Hi Alessandro,

i added classification in solr , but it is showing error like ” Load error: Error loading class 'ClassificationUpdateProcessorFactory”,

here is my solrconfig.xml.

case_title^1.5,case_history
Issue_Group
bayes

classification

is solr comes with classification algorithm? or should i need to add jar file in solr path?

Please help me to do it.

Thanks and Regards.

Loading...

Reply
Alessandro Benedetti says:

June 30, 2016 at 1:30 pm

Hi Manohar,
Which version of Solr are you using ?
Without applying any patch you need Solr 6.1 .
Following the blog documentation should be enough to have it working, in the case you need some help, let me know!

Cheers

Loading...

Reply
manohar c says:

June 30, 2016 at 4:24 pm

Hi Alessandro,

Thanks for quick response.

I am using solr 5.4.1, it is showing error like : “SolrException: Error loading class 'ClassificationUpdateProcessorFactory'”

How can i add ClassificationUpdateProcessorFactory algorithm in solr 5.4.1

I have different types of categories like Battery failure, Fan failure, HDD, MEMORY , Power Supply etc., in a seperate file.

I am pulling documents from sql server, i need to assign those categories to my documents.

let say

{“id” : “5463789”,
“case_history” fan related issue,
“Issue_Group”:”Fan failure”,
}

{“id” : “5463789”,
“case_history” memory related issues ,
“Issue_Group”:”MEMORY”,
}

Here my solrconfig.xml :

case_history
Issue_Group
knn
10
1
5

classification

Thanks in advance.

Loading...

Reply
manohar c says:

July 4, 2016 at 7:41 am

I am using Solr 5.4.1 , it is showing error like : “SolrException: Error loading class 'ClassificationUpdateProcessorFactory'” .

Can you please tell me , how can i do “document classification” in solr 5.4.1

Loading...

Reply
Rizwaan Adil says:

August 17, 2016 at 10:54 am

I am facing the same problem. My classification field Issue_Group is not getting populated after the data pull is over.

No error messages are noticed in the logs either.

Loading...

Reply
lehuyen says:

October 14, 2016 at 3:02 am

Could you please give me an example how to classification in SOLR???

Loading...

Reply

Solr Document Classification – Part 1 – Indexing Time

Introduction to Solr Document Classification

N.B. This is supported from Solr 6.1

Solr Document Classification

Classification Update Request Processor

Description

Configuration

K Nearest Neighbours Classifier

Simple Naive Bayes Classifier

Update Handler Configuration

Usage

Code

Need Help With This Topic?

Need Help with this topic?

Related

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Sease at Berlin Buzzwords 2024

Image Retrieval Using ViT + Generative Pre-trained Transformer (GPT)

OpenSearch Neural Search Tutorial: Hybrid Search

Monthly video

Sign up for our Newsletter

14 Responses

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

Solr Document Classification – Part 1 – Indexing Time

Introduction to Solr Document Classification

N.B. This is supported from Solr 6.1

Solr Document Classification

Classification Update Request Processor

Description

Configuration

K Nearest Neighbours Classifier

Simple Naive Bayes Classifier

Update Handler Configuration

Usage

Code

Need Help With This Topic?​​

Need Help with this topic?​

Related

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Sease at Berlin Buzzwords 2024

Image Retrieval Using ViT + Generative Pre-trained Transformer (GPT)

OpenSearch Neural Search Tutorial: Hybrid Search

Monthly video

Sign up for our Newsletter

14 Responses

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

Need Help With This Topic?

Need Help with this topic?