Apache Lucene, Main Blog

Word2Vec Model To Generate Synonyms – Performance Testing

This blog post series explores our contribution to Apache Lucene, which integrates a Word2Vec model with the text analysis pipeline to generate synonyms on the fly. In particular:

Introduction to synonym expansion technique and state of the art in Apache Lucene and Solr
Word2Vec algorithm and our implementation
Our Lucene Contribution, showing some examples at both index and query time [coming soon…]

In this blog post, we just want to give you an overview of the performances given by our implementation in terms of time and memory.

Test Configuration

Our aim is to create a word2vec model that will be trained using the data we want to index. Once the training is complete, the resulting output vectors for each word will be used to identify synonyms.

Our source of documents for training data was the Italian Wikipedia. We employed the Wikipedia extractor, which is a Python tool that extracts plain text from a Wikipedia database dump and eliminates any additional information or annotations such as images, tables, references, and lists present in the Wikipedia pages. A sample document is:

				
					<doc id="2" url="http://it.wikipedia.org/wiki/Armonium"> 
Armonium. 
L'armonium (in francese, “harmonium”) è uno strumento musicale azionato con una tastiera, detta manuale. Sono stati costruiti anche alcuni armonium con due manuali. 
... 
... 
</doc>

Following that, we indexed all the downloaded documents in a Lucene index named “italian_wiki_data”, where the text of the Italian Wikipedia pages was indexed in the “text” field.

We, therefore, trained a Word2Vec model using our Java tool LuceneWord2VecModelTrainer and finally, we used the trained models for testing the synonym expansion at both index and query time.

Local tests have been performed using a Mac OS laptop.

Model Training

We set up word2vec using Deeplearning4j, a deep learning library for the Java virtual machine (JVM), that has an out-of-the-box implementation of word2vec, based on the Continuous Skip-gram model.

A piece of example code (you can find the whole code implementation here):

				
					public class Config {
    private final String indexPath;
    private final String fieldName;
    private final String modelFilePath;
} #1

FieldValuesSentenceIterator iterator = new FieldValuesSentenceIterator(config); #2

Word2Vec vec = new Word2Vec.Builder() #3
   .layerSize(100)
   .windowSize(5)
   .iterate(iterator) #4
   .build(); 
vec.fit();

#1 This code snippet defines a Java class called Config that has three private instance variables: indexPath: path to a Lucene index, “italian_wiki_data” in our case
fieldName: specific field to fetch the values from, “text” in our case
modelFilePath: name of the output model, i.e. wiki-ita-w2v-model.zip

These will be the required parameters to be passed to perform the training:

				
					java -jar build/libs/LuceneWord2VecModelTrainer.jar -p &ltindex_path&gt -f &ltfield_name&gt -o &ltmodel_file&gt

#2 A FieldValuesSentenceIterator is built to read stored values from the Lucene index.
In this example, the text of the Italian Wikipedia pages was indexed in the “text” field and we, therefore, fetch the sentences and words to be used for training the word2vec model from that field.
An instance of the Config class is passed to a constructor for a FieldValuesSentenceIterator that uses the configuration settings stored in the Config object to iterate over sentences associated with specific field values in the index.
#3 It creates the configuration for word2vec, with a set of parameters (layerSize, windowSize, etc..)
#4 It sets word2vec to iterate over the selected corpus, i.e. FieldValuesSentenceIterator is passed to the word2vec’s implementation.

Once the training is finished, the tool generates as output a Word2Vec model named “wiki-ita-w2v-model.zip”, which has a size of approximately 375 MB.
The model can now be utilized for testing the Lucene indexing and searching process.
With this new feature, the search engine will be able to learn to generate synonyms from the data it handles.

For more info about the model training, you can have a look at our GitHub project here.

Index Time Results

As you should already know, synonyms can be applied both at indexing and query time.

Using synonyms at indexing time has some disadvantages:

pollution of terms document frequencies in the index (a word will appear in a document even if initially it was not)
need to reindex if synonyms change or get added
potential problems with multi-term synonyms and phrase queries
index size expansion

It is not an entirely bad practice and there may be situations where you want to expand synonyms at index time to manipulate document frequencies; however, doing it at query time is generally the preferred approach.

The use of synonyms at indexing time can be an expensive operation since it requires expanding the synonyms for every single word in your corpus.

In fact, from our experiments, we saw that indexing roughly 25MB of documents (2.539 pages) required about 4 minutes; while indexing about 4GB of documents (1.821.573 pages) would take far too many hours and is therefore not recommended.

This is the text analysis we used for the test:

				
					Analyzer analyzer = CustomAnalyzer.builder()
         .withTokenizer(ClassicTokenizerFactory.NAME)
         .addTokenFilter(StopFilterFactory.NAME, "words", "stopwords_it.txt", "format", "snowball", "ignoreCase", "true")
         .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wiki-ita-w2v-model.zip")
         .addTokenFilter(FlattenGraphFilterFactory.NAME)
         .addTokenFilter(LowerCaseFilterFactory.NAME)
         .build();

The heap memory usage of our Java application was tracked using VisualVM, and the accompanying graph displays the changes over the course of its execution, for indexing 25MB data:

If we index the same documents without performing the synonym expansion these are the performances:
25MB data: ~5 seconds
4GB data: ~7 minutes

The memory profile for indexing 25MB of data without using synonym expansion is presented below:

Our advice is to use the Word2VecSynonymFilterFactory at index time with caution and especially only if you have few documents to index or small fields.

Query Time Results

To evaluate the performance of the search functionality, we created a file consisting of 1000 random terms that were used as queries. The objective was to assess the synonyms expansion feature for each term in the file.

This is the text analysis we used for the test:

				
					Analyzer analyzer = CustomAnalyzer.builder()
         .withTokenizer(ClassicTokenizerFactory.NAME)
         .addTokenFilter(StopFilterFactory.NAME, "words", "stopwords_it.txt", "format", "snowball", "ignoreCase", "true")
         .addTokenFilter(Word2VecSynonymFilterFactory.NAME, "model", "wiki-ita-w2v-model.zip")
         .addTokenFilter(LowerCaseFilterFactory.NAME)
         .build();

The Java code for this test reads input from the text file containing 1000 query terms, parses each term using a Lucene parser, expands synonyms for each term, and then executes a search using the parsed query. The search returns the top 10 documents matching the query.
The whole process was executed in 1 minute and 30 seconds and in particular:

the custom analyzer builder was executed in ~1 minute
the average time for synonym expansion was 0.2823 ms
the average memory for synonym expansion was 458.40 MB
the average time to find the top 10 hits for a query with synonyms was 6.882 ms

We monitored our Java application using VisualVM and the following is the graph showing the heap memory usage for the duration of this application’s execution:

So, you have to keep in mind that more or less the size of the word2vec model (in our case the zip file is 375 MB) will impact almost 1:1 with the memory occupied by Lucene.

Regarding the time required to expand the synonyms of a single term and return the first 10 documents for that query, it took around 0.28 ms and 7 ms, respectively. These are reasonable and usable times.

We did a talk at Berlin Buzzwords 2022 about this!

Check out the talk about this project at Berlin Buzzwords 2022

Click Here

apachelucene, apachesolr, informationretrieval, performance, query, stresstest, synonymexpansion, synonyms, word2vec

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

3 Responses

Mazen Raafat says:

October 26, 2023 at 10:08 am

Hi, thanks for sharing those informative blogs,

just noticed the following section links are all broken and redirects to wp-admin edit

Introduction to synonym expansion technique and state of the art in Apache Lucene and Solr
Word2Vec algorithm and our implementation
Our Lucene Contribution, showing some examples at both index and query time

please update them thanks.

Loading...

Reply
1. Lisa Biella says:
  
  October 30, 2023 at 6:54 pm
  
  Hi Mazen, thanks for your message. The links should be fixed now!
  
  Loading...
  
  Reply
Ricardo Chavez says:

January 29, 2024 at 10:46 pm

Hi! Great blog post 🙂
I have been wondering if it is possible to somehow add this synonym search to an opensearch db, as a query expansion.

Loading...

Reply

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Word2Vec Model To Generate Synonyms – Performance Testing

Test Configuration

Model Training

Index Time Results

Query Time Results

We did a talk at Berlin Buzzwords 2022 about this!

Other posts you may find useful

Elasticsearch Neural Search Improvements in 8.6 and 8.7

Solr Document Classification – Part 1 – Indexing Time

Apache Solr: Chaining SearchHandler instances: the CompositeRequestHandler

Ilaria Petreti

Ilaria Petreti

Follow Us

Top Categories

Recent Posts

Scalar Quantization of Dense Vectors in Apache Solr

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

Monthly video

Sign up for our Newsletter

3 Responses

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Word2Vec Model To Generate Synonyms – Performance Testing

Test Configuration

Model Training

Index Time Results

Query Time Results

We did a talk at Berlin Buzzwords 2022 about this!

Other posts you may find useful

Elasticsearch Neural Search Improvements in 8.6 and 8.7

Solr Document Classification – Part 1 – Indexing Time

Apache Solr: Chaining SearchHandler instances: the CompositeRequestHandler

Ilaria Petreti

Ilaria Petreti

Follow Us

Top Categories

Recent Posts

Scalar Quantization of Dense Vectors in Apache Solr

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

Monthly video

Sign up for our Newsletter

3 Responses

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)