Entity Search, Main Blog

Entity Search with graph embeddings – Part 2 – Embeddings and clustering

Let’s continue our journey into this entity search thesis!

In Part 1 we have described what entities and entity search are. We have explained how this search is implemented in the state-of-the-art. We have also introduced the new approach of this dissertation specifying also the dataset and the test collection used. Finally, we have described graph embeddings techniques, focusing on the specific algorithm we use: Node2Vec.

In this post we will describe the first part of the project pipeline: entities representation and clustering.

Entities representation

As we saw in the previous post, the input of our pipeline is the set of all the RDF triples of the subgraph we have selected from the DBpedia dataset. Now we want to use these triples in order to obtain those entities representations necessary for the subsequent clustering phase.

To obtain these representations, we created two files from the two different tables, nodes and edges, we have stored in the database. These files will be the input for our embedding algorithm Node2Vec. In particular, we used the Python implementation of Node2Vec, which you can find here because it adds support for big graphs that cannot fit in memory.

The requested input for this algorithm is a graph, defined through its nodes and edges. Once given these parameters through our two files, we executed the Node2Vec algorithm. This will return us a file containing the embeddings in a numerical vector form.

For the Node2Vec execution, there are several parameters we can set:

Embedding dimension: the number of features we want to use in our entity representation;
Walk length: the length of each walk we create in the training phase of the neural network;
Number of walks: the number of walks we want to create starting from the node we want to represent;
p: parameter that defines the probability of revisiting an already-seen node;
q: parameter that defines the probability of visiting a new node;
Workers: the degree of parallelism we want to use in the algorithm execution.

The tuning of these parameters is fundamental for the retrieval process because it directly affects the embedding quality and therefore the document creation.

Clustering

Once we have these embeddings, we want to cluster them by similarity to subsequently create our documents. As mentioned in Part 1 we want to associate a document to each cluster.

To create these sets of entities embeddings we used the K-MeansSort algorithm. It is a modified version of the classic K-Means, in particular, it partially changes how the algorithm decides which class assign points. Wanting to provide a more detailed description, K-MeansSort speeds up this decision process exploiting the triangle inequality and the means sorting. It uses them to reduce the number of comparisons between the distances calculated between points and means.

For the execution of this clustering phase, we used a framework called ELKI [1]. This framework gives tools for the implementation of data mining algorithms, in particular for clustering methods. ELKI is an open source data mining software, written in Java, that offers data index structures to provide major performance gains. It is also highly scalable and has a modular approach which makes it fast, versatile and easy to use.

Since K-MeansSort is based on K-means, it requires input of the number of clusters we want to obtain at the end of the process.

Need Help With This Topic?

If you’re struggling with entities embedding, don’t worry – we’re here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Need Help with this topic?

If you're struggling with entities embedding, don't worry - we're here to help! Our team offers expert services and training to help you optimize your search engine and get the most out of your system. Contact us today to learn more!

Click Here

clustering, entity Search

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Entity Search with graph embeddings – Part 2 – Embeddings and clustering

Entities representation

Clustering

Need Help With This Topic?

Need Help with this topic?

Other posts you may find useful

Elasticsearch Neural Search Improvements in 8.6 and 8.7

Solr Document Classification – Part 1 – Indexing Time

Apache Solr: Chaining SearchHandler instances: the CompositeRequestHandler

Anna Ruggero

Anna Ruggero

Follow Us

Top Categories

Recent Posts

Scalar Quantization of Dense Vectors in Apache Solr

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Entity Search with graph embeddings – Part 2 – Embeddings and clustering

Entities representation

Clustering

Need Help With This Topic?​​

Need Help with this topic?​

Other posts you may find useful

Elasticsearch Neural Search Improvements in 8.6 and 8.7

Solr Document Classification – Part 1 – Indexing Time

Apache Solr: Chaining SearchHandler instances: the CompositeRequestHandler

Anna Ruggero

Anna Ruggero

Follow Us

Top Categories

Recent Posts

Scalar Quantization of Dense Vectors in Apache Solr

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help With This Topic?

Need Help with this topic?