Apache Lucene Main Blog
Lucene Classification

Introduction

This blog post describes the approach used in the Lucene Classification module to adapt text classification to document ( multi field ) classification.

Machine Learning and Search have been always strictly associated.
Machine Learning can help to improve the Search Experience in a lot of ways, extracting more information from the corpus of documents, auto classifying them and clustering them.
On the other hand , Search related data structures are quite similar in content to the ML models used to solve the problems, so we can use them to build up our own algorithms.
But let’s go in order…
In this article we are going to explore how much Auto Classification can be important to easy the user experience in Search, and how will be possible to have an easy, out of the box painless classification directly from Lucene, from our already existent Index, without any external dependency or trained model.

Classification

From Wikipedia :

In machine learning and statisticsclassification is the problem of identifying to which of a set of categories (sub-populations) a new observation belongs, on the basis of a training set of data containing observations (or instances) whose category membership is known. An example would be assigning a given email into spam” or “non-spam classes or assigning a diagnosis to a given patient as described by observed characteristics of the patient (gender, blood pressure, presence or absence of certain symptoms, etc.).

Classification is a very well known problem, and it is really easy to verify the utility of classifiers on a  daily basis.
When dealing with Search Engines, it is so common that our Documents have a category ( generally assigned by a human ). Wouldn’t be really useful to be able to automatic extract the category for the documents that are missing it? Wouldn’t be really interesting to automate that processing step to have documents automatically categorised, based on the experience? ( which means using all the documents already categorised in our Search System)
Wouldn’t be amazing to do that without any external tool and without any additional expensive model training activity?
We already have a lot of information in our Search System, let’s use it!

Lucene Classification Module

To be able to provide a Classification capability, our system generally needs a trained model.
Classification is a problem solved by Supervised Machine learning algorithms which means humans need to provide a training set.
A classification training set is a set of documents, manually tagged with the correct class.
It is the “experience” that the Classification system will use to classify upcoming unseen documents.
Building a trained model is expensive and requires to be stored in specific data structures.
Is it really necessary to build an external trained model when we already have in our index million of documents, already classified by humans and ready to be searched ?
A Lucene index has already really interesting data structures that relates terms to documents ( Inverted Index), fields to terms ( Term Vectors) and all the information we need about Term Frequency and Document Frequency.
A corpus of documents, already tagged with the proper class in a specific field, can be a really useful resource to build our classification inside our search engine without any external tool or data structure.
Based on these observations, the Apache Lucene [1] Open Source project introduced version 4.2  a document classification module to manage the text classification using the Index data structures.
The facade for this module is the text Classifier, a simple interface ( with 3 implementations available ) :

public interface Classifier {/** * Assign a class (with score) to the given text String *
* @param text a String containing text to be classified *
@return a {@link ClassificationResult} holding assigned class of type T and score *

ClassificationResult assignClass(String text) throws IOException;

/** * Get all the classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@return the whole list of {@link ClassificationResult}, the classes and scores. Returns null if the classifier can’t make lists. */

List getClasses(String text) throws IOException;

/** * Get the first max classes (sorted by score, descending) assigned to the given text String.
@param text a String containing text to be classified *
@param max the number of return list elements *
@return the whole list of {@link ClassificationResult}, the classes and scores. Cut for “max” number of elements. Returns null if the classifier can’t make lists. */

List getClasses(String text, int max) throws IOException;
}

The available implementations are :

  • KNearestNeighborClassifier – A k-Nearest Neighbor classifier [2] based on MoreLikeThis
  • SimpleNaiveBayesClassifier – A simplistic Lucene based NaiveBayes [3] classifier 
  • BooleanPerceptronClassifier – A perceptron [4] based Boolean Classifier. The weights are calculated using TermsEnum.totalTermFreq both on a per field and a per document basis and then a corresponding FST is used for class assignment.

Let’s see the first 2 in details :

KNearestNeighborClassifier
This classifier is based on the Lucene More like This [5]
A feature able to retrieve similar documents to a seed one ( or a seed text) calculating the similarity between the docs on a field base.
It takes in input the list of fields to take in consideration ( with relative boost factor to increase the importance of some field over the others).
 
The idea behind this algorithm is simple :
 – given a set of relevant fields for our classification
 – given a field containing the class of the document
It retrieves all the similar documents to the Text in input using the MLT.
 
Only the documents with the class field valorised, are taken in consideration.
The top k documents in result are evaluated, and the class extracted from all of them.
Then a ranking of the retrieved classes is made, based on the frequency of the class in the top K documents.
Currently the algorithm takes in consideration only the frequency of the class to calculate its score.
One limitation is that is not taking in consideration the ranking of the class.
This means that if in the top K, we have the first k/2 documents of class C1 and the second k/2 document of class C2, both the classes will have the same score (LUCENE-6654) [6].
 
This classifier works on top of the Index, let’s quickly have an overview of the constructor parameters :  
ParameterDescription
leafReaderThe Index Reader that will be used to read the index data structures and classify the document
analyzerThe Analyzer to be used to analyse the input unseen text
queryA filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification
kthe no. of top docs to select in the MLT results to find the nearest neighbor
minDocsFreqA term (from the input text) will be taken in consideration by the algorithm only if it appears at least in this minimum number of docs in the index
minTermFreqA term (from the input text) will be taken in consideration by the algorithm only if it appears at least this minimum number of times in the input
classFieldNameThe field that contains the class of the document. It must appear in the indexed documents. MUST BE STORED
textFieldNamesThe list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents

Note :  MoreLikeThis constructs a Lucene query based on terms in a document. It does this by pulling terms from the defined list of fields (the textField  
Names parameter, below). 

You usually read that for best results, the fields should have stored term vectors.
In our case, the text to classify is unseen, it is not and indexed document.
So the TermVector is not used at all.

SimpleNaiveBayesClassifier
This Classifier is based on a simplistic implementation of a Naive Bayes Classifier [3].
It uses the Index to get the Term frequencies of the terms, the Doc frequencies and the unique terms.
First of all it extract all the possible classes, this is obtained getting all the terms for the class field.
Then , each class is scored based on
– how frequent is the class in all the classified documents in the index
– how much likely the tokenized text is to belong to the class
The details of the algorithm go beyond the scope of this blog post.
 
This classifier works on top of the Index, let’s quickly have an overview of the constructor parameters :
ParameterDescription
leafReaderThe Index Reader that will be used to read the index data structures and classify the document
analyzerThe Analyzer to be used to analyse the input unseen text
queryA filter to apply on the indexed documents. Only the ones that satisfy the query will be used for the classification
classFieldNameThe field that contains the class of the document. It must appear in the indexed documents
textFieldNamesThe list of fields to be taken in consideration for doing the classification. They must appear in the indexed documents

Note :  The NaiveBayes Classifier works on terms from the index. This means it pulls from the index the tokens for the class field. Each token will be considered as class, and will see a score associated.

This means that you must be careful in the analysis you choose for the classField and ideally use a not tokenizer field containing the class ( a copyField if necessary)

Document Classification Extension

The original Text Classifiers are perfectly fine, but what about Document classification ?
In a lot of cases the information we want to classify is actually composed of a set of fields with one or more value each one.
Each field content contributes to the classification ( with a different weight to be precise).
Given a simple News article, the title is much more important for the classification in comparison with the text and the author.
But even if not so relevant even the author can have a little part in the class assignation.
Lucene atomic unit of information is the Document. 

Documents are the unit of indexing and search. A Document is a set of fields. Each field has a name and a textual value.

This data structure is a perfect input for a new generation of Classifiers that will benefit of augmented information to assign the relevant class(es).
Whenever a simple text is ambiguous, including other fields in the analysis is improving dramatically the precision of the classification.
In details :

    • the field content from the Input Document will be compared with the data structures in the index, only related to that field
    • an input field will be analysed accordingly to its own analysis chain, allowing greater flexibility and precision
    • an input field can be boosted, to affect more the classification. In this way different portions of the input document will have a different relevancy to discover the class of the document.
A new interface is provided :

public interface DocumentClassifier<T>

/** * Assign a class (with score) to the given {@link org.apache.lucene.document.Document} * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return a {@link org.apache.lucene.classification.ClassificationResult} holding assigned class of type T and score */

ClassificationResult<T> assignClass(Document document) throws IOException;

/** * Get all the classes (sorted by score, descending) assigned to the given {@link org.apache.lucene.document.Document}. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Returns null if the classifier can’t make lists.*/

List<ClassificationResult<T>> getClasses(Document document) throws IOException;

/** * Get the first max classes (sorted by score, descending) assigned to the given text String. * * @param document a {@link org.apache.lucene.document.Document} to be classified. Fields are considered features for the classification. * @param max the number of return list elements * @return the whole list of {@link org.apache.lucene.classification.ClassificationResult}, the classes and scores. Cut for “max” number of elements. Returns null if the classifier can’t make lists.*/ 

List<ClassificationResult<T>> getClasses(Document document, int max) throws IOException;

And 2 Classifier extended to provide the new functionality :

  • KNearestNeighborDocumentClassifier 
  • SimpleNaiveBayesDocumentClassifier 

The first implementation is available as a contribution for this Jira issue :

LUCENE-6631

Now that a new interface is available, it will be much easier to integrate it with Apache Solr.

Stay tuned for the Solr Classification Integration.

// our service

Shameless plug for our training and services!

Did I mention we do Apache Solr Beginner and Elasticsearch Beginner training?
We also provide consulting on these topics, get in touch if you want to bring your search engine to the next level!

// stay always up to date

Subscribe to our newsletter

Did you like this post about Lucene Document Classification? Don’t forget to subscribe to our Newsletter to stay always updated from the Information Retrieval world!

Author

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

Comment (1)

  1. Mike Monette
    November 30, 2016

    Hi, I'm trying to implement classification with Lucene 6.3. Looking at the code from your patch, I see that you used SlowCompositeReaderWrapper to get a LeafReader to feed to the Classifier constructors. It looks like SlowCompositeReaderWrapper is no longer in Lucene (it's been moved into Solr). Could you tell me how to create a LeafReader instance? Or do you know of any examples which use purely Lucene 6+ for classification?
    Thanks

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: