Search

Neo4J Optimization Tips

This ‘tips and tricks’ blog post is a result of our collaboration with the University of Padua, wherein the student Riccardo Forzan played a significant role in selecting the topic and contributing a major portion of the content.

Neo4j is a popular graph database management system designed to store, manage, and query a growing volume of connected data using graph-based structures.

It implements a “bare metal” system to store graph databases and uses index-free adjacency, so every node references its adjacent/next node directly.

This is critical for high-performance graph queries since native graph queries and processing will thus perform at a constant rate regardless of data size.

This blog post covers optimizations for Neo4j, with a specific focus on two aspects:

  1. improving the efficiency of queries that are issued to the database
  2. exploring methods to optimize the system where Neo4j is executed to enhance its overall performance.

Optimizing queries

Query optimization is a crucial aspect of improving the performance and efficiency of a database system. Here you can find three strategies to achieve this:

1) Use functions provided by the DBMS to analyze your queries

To better understand the queries and ensure that they run efficiently, you can analyze them using two functions provided by Neo4j, Explain and Profile:

EXPLAIN

query to see the execution plan, without executing the query:

PROFILE

query to track how the rows are passed through operators:

Leveraging these two ‘Cypher‘ (the query processing component in Neo4j) instructions can provide valuable insights into query execution to help us understand what the possible performance problems are and identify areas for improvement.

2) Reduce as soon as possible the working set of a query

To speed up queries you should aim to reduce the query working set as soon as possible.
How could this be done?

  • Move the usage of DISTINCT and LIMIT as early as possible in the query
  • When possible, use COLLECT on places in the query to reduce the number of rows to be processed during the execution

Consider the following example:

This query accesses the property name for all the nodes found after the MATCH operation.
At the end of the query, only the top 20 results are returned in the result set.
How could this be optimized?!

This second query accesses the property name only for the top 20 results. The benefits achieved through this optimization vary based on the size of the result set.

3) Use parametrized queries to leverage caching systems

When running a query in a database, if the requested data is not already in the cache, it must be loaded from the disk, making the initial execution of a new query relatively slower.
In subsequent executions of the same query, if data are available in the cache, response times are significantly reduced because reading from the cache is much faster than retrieving data from the disk.
In addition, any other query that requires the same data as the cached query can benefit from the data already in the cache.

When the Cypher engine receives a query string, it compiles an execution plan for each query it receives and stores this plan in the Query Cache.

Using parametrized queries instead of literal values allows the Cypher engine to reuse the precompiled execution plan and cached results to efficiently answer the query, whereas if literal values are used, the query needs to be parsed again.

This makes parameterized queries a more efficient choice, as they take advantage of the caching system to avoid unnecessary reprocessing of queries and improve overall query performance.

Optimizing the system

The optimization of the system in which the Database Management System (DBMS) operates involves the introduction of various adjustments to improve the efficiency of the database, such as the use of random access memory (RAM) and the fine-tuning of storage configuration.

Optimizing the usage of the RAM

1) Optimizing the size of the page cache

As the name suggests, the page cache is a mechanism used to reduce the need for disk access by storing frequently accessed data in memory.

In an ideal scenario, the goal is to have the page cache large enough to hold the entire database, so as to significantly reduce disk reads. However, for very large databases, storing the entire dataset in memory may not be feasible due to memory constraints.

To determine the appropriate size for the page cache, you can inspect the actual memory usage of the memory by running the command:

				
					neo4j-admin server memory-recommendation
				
			

This command provides recommendations regarding the configuration of memory parameters for your Neo4j DBMS, based on the available system resources and the characteristics of the database.

2) Optimize the settings of the JVM

Neo4J (as the name may suggest) has been developed using Java, which is a garbage-collected language. The garbage collector has the duty of freeing the memory from unused objects.

In Neo4J configuration you can set two parameters that influence the dimension of the heap:

				
					server.memory.heap.initial_size
				
			
				
					server.memory.heap.max_size
				
			

It’s generally recommended to set these two parameters to the same value in order to avoid a stop-the-world event.

You can determine the value of those two parameters by looking at the history of resource usage.

Optimizing the configuration of storage

1) Use EXT4 and XFS, and avoid NFS or NAS file systems

Since NFS and NAS file systems do not offer control over locking files, this may result in concurrent accesses that can lead to corruption.

2) Store data and transaction logs on separate drives

Storing data and transaction logs on physically different drives is a recommended practice in database management to improve performance and reduce contention.
In this way, data read and write operations, along with transaction recording, can occur simultaneously, resulting in increased I/O efficiency and system responsiveness.

Do You Want To Be Published?

This blog post is part of our collaboration with the University of Padua. If you are a University student or professor and want to collaborate, contact us through e-mail.

Do You Want To Be Published?

This blog post is part of our collaboration with the University of Padua. If you are a University student or professor and want to collaborate, contact us through e-mail.

Other posts you may find useful

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.