Let’s continue our journey into this entity search thesis!

In Part 1 we have described what entities and entity search are. We have explained how this search is implemented in the state-of-the-art. We have also introduced the new approach of this dissertation specifying also the dataset and the test collection used. Finally we have described graph embeddings techniques, focusing on the specific algorithm we use: Node2Vec.

In this post we will describe the first part of the project pipeline: **entities representation** and **clustering**.

## Entities representation

As we saw in the previous post, the input of our pipeline is the set of all the RDF triples of the subgraph we have selected from the DBpedia dataset. Now we want to use these triples in order to obtain those entities representations necessary for the subsequent clustering phase.

In order to obtain these representations, we created two files from the two different tables, nodes and edges, we have stored in the database. These files will be the input for our embedding algorithm Node2Vec. In particular we used the Python implementation of Node2Vec, that you can find here, because it adds support for big graphs that cannot fit in memory.

The requested input for this algorithm is a graph, defined through its nodes and edges. Once given these parameters through our two files, we executed the Node2Vec algorithm. This will return us a file containing the embeddings in a numerical vector form.

For the Node2Vec execution there are several **parameters** we can set:

**Embedding dimension**: the number of features we want to use in our entity representation;**Walk length**: the length of each walk we create in the training phase of the neural network;**Number of walks**: the number of walks we want to create starting from the node we want to represent;**p**: parameter that defines the probability of revisiting an already-seen node;**q**: parameter that defines the probability of visiting a new node;**Workers**: the degree of parallelism we want to use in the algorithm execution.

The tuning of these parameters is fundamental for the retrieval process because it directly affects the embedding quality and therefore the documents creation.

## Clustering

Once obtained these embeddings, we want to cluster them by similarity in order to subsequently create our documents. As mentioned in Part 1 we want to associate a document to each cluster.

To create these sets of entities embeddings we used the **K-MeansSort** algorithm. It is a modified version of the classic K-Means, in particular it partially changes the way in which the algorithm decides to which class assign points. Wanting to provide a more detailed description, K-MeansSort speeds up this decision process exploiting the triangle inequality and the means sorting. It uses them in order to reduce the number of comparisons between the distances calculated between points and means.

For the execution of this clustering phase, we used a framework called ELKI. This framework gives tools for the implementation of data mining algorithm, in particular for clustering methods. ELKI is an open source data mining software, written in Java, that offers data index structures to provide major performance gains. It is also highly scalable and has a modular approach which makes it fast, versatile and easy to use.

Since K-MeansSort is based on K-means, it requires in input the number of clusters we want to obtain at the end of the process.

Anna Ruggero is a software engineer passionate about Information Retrieval and Data Mining.

She loves to find new solutions to problems, suggesting and testing new ideas, especially those that concern the integration of machine learning techniques into information retrieval systems.