This ‘tips and tricks’ blog post is a result of our collaboration with the University of Padua, wherein the student Riccardo Forzan played a significant role in selecting the topic and contributing a major portion of the content.
Neo4j is a popular graph database management system designed to store, manage, and query a growing volume of connected data using graph-based structures.
It implements a “bare metal” system to store graph databases and uses index-free adjacency, so every node references its adjacent/next node directly.
This is critical for high-performance graph queries since native graph queries and processing will thus perform at a constant rate regardless of data size.
This blog post covers optimizations for Neo4j, with a specific focus on two aspects:
- improving the efficiency of queries that are issued to the database
- exploring methods to optimize the system where Neo4j is executed to enhance its overall performance.
Optimizing queries
Query optimization is a crucial aspect of improving the performance and efficiency of a database system. Here you can find three strategies to achieve this:
1) Use functions provided by the DBMS to analyze your queries
To better understand the queries and ensure that they run efficiently, you can analyze them using two functions provided by Neo4j, Explain and Profile:
EXPLAIN
query to see the execution plan, without executing the query:
PROFILE
query to track how the rows are passed through operators:
Leveraging these two ‘Cypher‘ (the query processing component in Neo4j) instructions can provide valuable insights into query execution to help us understand what the possible performance problems are and identify areas for improvement.
2) Reduce as soon as possible the working set of a query
To speed up queries you should aim to reduce the query working set as soon as possible.
How could this be done?
- Move the usage of DISTINCT and LIMIT as early as possible in the query
- When possible, use COLLECT on places in the query to reduce the number of rows to be processed during the execution
Consider the following example:
This query accesses the property name for all the nodes found after the MATCH operation.
At the end of the query, only the top 20 results are returned in the result set.
How could this be optimized?!
This second query accesses the property name only for the top 20 results. The benefits achieved through this optimization vary based on the size of the result set.
3) Use parametrized queries to leverage caching systems
When running a query in a database, if the requested data is not already in the cache, it must be loaded from the disk, making the initial execution of a new query relatively slower.
In subsequent executions of the same query, if data are available in the cache, response times are significantly reduced because reading from the cache is much faster than retrieving data from the disk.
In addition, any other query that requires the same data as the cached query can benefit from the data already in the cache.
When the Cypher engine receives a query string, it compiles an execution plan for each query it receives and stores this plan in the Query Cache.
Using parametrized queries instead of literal values allows the Cypher engine to reuse the precompiled execution plan and cached results to efficiently answer the query, whereas if literal values are used, the query needs to be parsed again.
This makes parameterized queries a more efficient choice, as they take advantage of the caching system to avoid unnecessary reprocessing of queries and improve overall query performance.
Optimizing the system
The optimization of the system in which the Database Management System (DBMS) operates involves the introduction of various adjustments to improve the efficiency of the database, such as the use of random access memory (RAM) and the fine-tuning of storage configuration.
Optimizing the usage of the RAM
1) Optimizing the size of the page cache
As the name suggests, the page cache is a mechanism used to reduce the need for disk access by storing frequently accessed data in memory.
In an ideal scenario, the goal is to have the page cache large enough to hold the entire database, so as to significantly reduce disk reads. However, for very large databases, storing the entire dataset in memory may not be feasible due to memory constraints.
To determine the appropriate size for the page cache, you can inspect the actual memory usage of the memory by running the command:
neo4j-admin server memory-recommendation
This command provides recommendations regarding the configuration of memory parameters for your Neo4j DBMS, based on the available system resources and the characteristics of the database.
2) Optimize the settings of the JVM
Neo4J (as the name may suggest) has been developed using Java, which is a garbage-collected language. The garbage collector has the duty of freeing the memory from unused objects.
In Neo4J configuration you can set two parameters that influence the dimension of the heap:
server.memory.heap.initial_size
server.memory.heap.max_size
It’s generally recommended to set these two parameters to the same value in order to avoid a stop-the-world event.
You can determine the value of those two parameters by looking at the history of resource usage.
Optimizing the configuration of storage
1) Use EXT4 and XFS, and avoid NFS or NAS file systems
Since NFS and NAS file systems do not offer control over locking files, this may result in concurrent accesses that can lead to corruption.
2) Store data and transaction logs on separate drives
Storing data and transaction logs on physically different drives is a recommended practice in database management to improve performance and reduce contention.
In this way, data read and write operations, along with transaction recording, can occur simultaneously, resulting in increased I/O efficiency and system responsiveness.
Do You Want To Be Published?
Do You Want To Be Published?
This blog post is part of our collaboration with the University of Padua. If you are a University student or professor and want to collaborate, contact us through e-mail.





