Lucene Document Classification
Introduction
This blog post describes the approach used in the Lucene Classification module to adapt text classification to document ( multi field ) classification.
Machine Learning and Search have been always strictly associated.
Machine Learning can help to improve the Search Experience in a lot of ways, extracting more information from the corpus of documents, auto classifying them and clustering them.
On the other hand , Search related data structures are quite similar in content to the ML models used to solve the problems, so we can use them to build up our own algorithms.
But let’s go in order…
In this article we are going to explore how much Auto Classification can be important to easy the user experience in Search, and how will be possible to have an easy, out of the box painless classification directly from Lucene, from our already existent Index, without any external dependency or trained model.
Classification
From Wikipedia :
In machine learning and statistics, classification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into “spam” or “non-spam“ classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).
Classification is a very well known problem, and it is really easy to verify the utility of classifiers on a daily basis.
When dealing with Search Engines, it is so common that our Documents have a category ( generally assigned by a human ). Wouldn’t be really useful to be able to automatic extract the category for the documents that are missing it? Wouldn’t be really interesting to automate that processing step to have documents automatically categorised, based on the experience? ( which means using all the documents already categorised in our Search System)
Wouldn’t be amazing to do that without any external tool and without any additional expensive model training activity?
We already have a lot of information in our Search System, let’s use it!
Lucene Classification Module
To be able to provide a Classification capability, our system generally needs a trained model.
Classification is a problem solved by Supervised Machine learning algorithms which means humans need to provide a training set.
A classification training set is a set of documents, manually tagged with the correct class.
It is the “experience” that the Classification system will use to classify upcoming unseen documents.
Building a trained model is expensive and requires to be stored in specific data structures.
Is it really necessary to build an external trained model when we already have in our index million of documents, already classified by humans and ready to be searched ?
A Lucene index has already really interesting data structures that relates terms to documents ( Inverted Index), fields to terms ( Term Vectors) and all the information we need about Term Frequency and Document Frequency.
A corpus of documents, already tagged with the proper class in a specific field, can be a really useful resource to build our classification inside our search engine without any external tool or data structure.
Based on these observations, the Apache Lucene [1] Open Source project introduced version 4.2 a document classification module to manage the text classification using the Index data structures.
The facade for this module is the text Classifier, a simple interface ( with 3 implementations available ) :
public interface Classifier {/** * Assign a class (with score) to the given text String *
* @param text a String containing text to be classified *
@return a {@link ClassificationResult} holding assigned class of type T and score *
ClassificationResult assignClass(String text) throws IOException;
/** * Get all the classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@return the whole list of {@link ClassificationResult}, the classes and scores. Returns null if the classifier can’t make lists. */
List getClasses(String text) throws IOException;
/** * Get the first max classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@param max the number of return list elements *
@return the whole list of {@link ClassificationResult}, the classes and scores. Cut for “max” number of elements. Returns null if the classifier can’t make lists. */
List getClasses(String text, int max) throws IOException;
}
The available implementations are :
- KNearestNeighborClassifier – A k-Nearest Neighbor classifier [2] based on MoreLikeThis
- SimpleNaiveBayesClassifier – A simplistic Lucene based NaiveBayes [3] classifier
- BooleanPerceptronClassifier – A perceptron [4] based Boolean Classifier. The weights are calculated using TermsEnum.totalTermFreq both on a per field and a per document basis and then a corresponding FST is used for class assignment.
Let’s see the first 2 in details :
KNearestNeighborClassifier
Parameter | Description |
---|---|
leafReader | The Index Reader that will be used to read the index data structures and classify the document |
analyzer | The Analyzer to be used to analyse the input unseen text |
query | A filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification |
k | the no. of top docs to select in the MLT results to find the nearest neighbor |
minDocsFreq | A term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index |
minTermFreq | A term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input |
classFieldName | The field that contains the class of the document. It must appear in the indexed documents. MUST BE STORED |
textFieldNames | The list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents |
Note : MoreLikeThis
constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields (the textField
Names parameter, below).
You usually read that for best results, the fields should have stored term vectors.
In our case, the text to classify is unseen, it is not and indexed document.
So the TermVector is not used at all.
SimpleNaiveBayesClassifier
Parameter | Description |
---|---|
leafReader | The Index Reader that will be used to read the index data structures and classify the document |
analyzer | The Analyzer to be used to analyse the input unseen text |
query | A filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification |
classFieldName | The field that contains the class of the document. It must appear in the indexed documents |
textFieldNames | The list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents |
Note : The NaiveBayes Classifier works on terms from the index. This means it pulls from the index the tokens for the class field. Each token will be considered as class, and will see a score associated.
This means that you must be careful in the analysis you choose for the classField and ideally use a not tokenizer field containing the class ( a copyField if necessary)
Document Classification Extension
Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value.
This data structure is a perfect input for a new generation of Classifiers that will benefit of augmented information to assign the relevant class(es).
Whenever a simple text is ambiguous, including other fields in the analysis is improving dramatically the precision of the classification.
In details :
- the field content from the Input Document will be compared with the data structures in the index, only related to that field
- an input field will be analysed accordingly to its own analysis chain, allowing greater flexibility and precision
- an input field can be boosted, to affect more the classification. In this way different portions of the input document will have a different relevancy to discover the class of the document.
public interface DocumentClassifier<T>
/** * Assign a class (with score) to the given {@link org.apache.lucene.document.Document} * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return a {@link org.apache.lucene.classification.ClassificationResult} holding assigned class of type T and score */
ClassificationResult<T> assignClass(Document document) throws IOException;
/** * Get all the classes (sorted by score, descending) assigned to the given {@link org.apache.lucene.document.Document}. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Returns null if the classifier can’t make lists.*/
List<ClassificationResult<T>> getClasses(Document document) throws IOException;
/** * Get the first max classes (sorted by score, descending) assigned to the given text String. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @param max the number of return list elements * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Cut for “max” number of elements. Returns null if the classifier can’t make lists.*/
List<ClassificationResult<T>> getClasses(Document document, int max) throws IOException;
And 2 Classifier extended to provide the new functionality :
- KNearestNeighborDocumentClassifier
- SimpleNaiveBayesDocumentClassifier
The first implementation is available as a contribution for this Jira issue :
Now that a new interface is available, it will be much easier to integrate it with Apache Solr.
Stay tuned for the Solr Classification Integration.
Shameless plug for our training and services!
Did I mention we do Apache Solr Beginner and Elasticsearch Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!
Subscribe to our newsletter
Did you like this post about Lucene Document Classification? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!
Related
Author
Alessandro Benedetti
Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.
Comment (1)
Leave a comment Cancel reply
This site uses Akismet to reduce spam. Learn how your comment data is processed.
Mike Monette
November 30, 2016Hi, I'm trying to implement classification with Lucene 6.3. Looking at the code from your patch, I see that you used SlowCompositeReaderWrapper to get a LeafReader to feed to the Classifier constructors. It looks like SlowCompositeReaderWrapper is no longer in Lucene (it's been moved into Solr). Could you tell me how to create a LeafReader instance? Or do you know of any examples which use purely Lucene 6+ for classification?
Thanks