As we saw in the previous post, the input of our pipeline is the set of all the RDF triples of the subgraph we have selected from the DBpedia dataset. Now we want to use these triples in order to obtain those entities representations necessary for the subsequent clustering phase.
In order to obtain these representations, we created two files from the two different tables, nodes and edges, we have stored in the database. These files will be the input for our embedding algorithm Node2Vec. In particular we used the Python implementation of Node2Vec, that you can find here, because it adds support for big graphs that cannot fit in memory.
The requested input for this algorithm is a graph, defined through its nodes and edges. Once given these parameters through our two files, we executed the Node2Vec algorithm. This will return us a file containing the embeddings in a numerical vector form.
For the Node2Vec execution there are several parameters we can set:
- Embedding dimension: the number of features we want to use in our entity representation;
- Walk length: the length of each walk we create in the training phase of the neural network;
- Number of walks: the number of walks we want to create starting from the node we want to represent;
- p: parameter that defines the probability of revisiting an already-seen node;
- q: parameter that defines the probability of visiting a new node;
- Workers: the degree of parallelism we want to use in the algorithm execution.
The tuning of these parameters is fundamental for the retrieval process because