Search

Sease at Berlin Buzzwords 2024

berlin buzzwords

Berlin Buzzwords 2024

Berlin Buzzwords is Germany’s most exciting conference on storing, processing, streaming and searching large amounts of digital data, with a focus on open source software projects.

Location: Kulturbrauerei, Berlin
Date: 9th-11th June 2024

// our talk

Hybrid Search with Apache Solr Reciprocal Rank Fusion

06-10, 12:00–12:40 (Europe/Berlin), Palais Atelier

Vector-based search gained incredible popularity in the last few years: Large Language Models fine-tuned for sentence similarity proved to be quite effective in encoding text to vectors and representing some of the semantics of sentences in a numerical form.
These vectors can be used to run a K-nearest neighbour search and look for documents/paragraphs close to the query in a n-dimensional vector space, effectively mimicking a similarity search in the semantic space (Apache Solr KNN Query Parser).
Although exciting, vector-based search nowadays still presents some limitations:
– it’s very difficult to explain (e.g. why is document A returned and why at position K?)
– it doesn’t care about exact keyword matching (and users still rely on keyword searches a lot)

Hybrid search comes to the rescue, combining lexical (traditional keyword-based) search with neural (vector-based) search.
So, what does it mean to combine these two worlds?
It starts with the retrieval of two sets of candidates:
– one set of results coming from lexical matches with the query keywords
– a set of results coming from the K-Nearest Neighbours search with the query vector

The result sets are merged and a single ranked list of documents is returned to the user.
Reciprocal Rank Fusion (RRF) is one of the most popular algorithms for such a task.
This talk introduces the foundation algorithms involved with RRF and walks you through the work done to implement them in Apache Solr, with a focus on the difficulties of the process, the distributed support(SolrCloud), the main components affected and the limitations faced.
The audience is expected to learn more about this interesting approach, the challenges in it and how the contribution process works for an Open Source search project as complex as Apache Solr.

// our speaker

Alessandro Benedetti

FOUNDER @ SEASE

APACHE LUCENE/SOLR COMMITTER
APACHE SOLR PMC MEMBER

Senior Search Software Engineer, his focus is on R&D in Information Retrieval, Information Extraction, Natural Language Processing, and Machine Learning.
He firmly believes in Open Source as a way to build a bridge between Academia and Industry and facilitate the progress of applied research.

video
slides
// our talk

From Natural Language to Structured Solr Queries using LLMs

06-10, 11:10–11:50 (Europe/Berlin), Maschinenhaus

This talk draws on experimentation to enable AI applications with Solr. One important use case is to use AI for better accessibility and discoverability of the data: while User eXperience techniques, lexical search improvements, and data harmonization can take organizations to a good level of accessibility, a structural (or “cognitive” gap) remains between the data user needs and the data producer constraints.
That is where AI – and most importantly, Natural Language Processing and Large Language Model techniques – could make a difference. This natural language, conversational engine could facilitate access and usage of the data leveraging the semantics of any data source.
The objective of the presentation is to propose a technical approach and a way forward to achieve this goal.
The key concept is to enable users to express their search queries in natural language, which the LLM then enriches, interprets, and translates into structured queries based on the Solr index’s metadata.
This approach leverages the LLM’s ability to understand the nuances of natural language and the structure of documents within Apache Solr.
The LLM acts as an intermediary agent, offering a transparent experience to users automatically and potentially uncovering relevant documents that conventional search methods might overlook. The presentation will include the results of this experimental work, lessons learned, best practices, and the scope of future work that should improve the approach and make it production-ready.

// our speakers

Anna Ruggero

R&D SOFTWARE ENGINEER @ SEASE

R&D Search Software Engineer, her focus is on the integration of Information Retrieval systems with advanced Machine Learning, Neural Search models ad Recommender Systems.
She likes to find new solutions that integrate her work as a Search Consultant with the latest academic studies.

Ilaria Petreti

R&D SOFTWARE ENGINEER @ SEASE

Ilaria is a Data Scientist passionate about the world of Artificial Intelligence. She loves applying Data Mining and Machine Learnings techniques, strongly believing in the power of Big Data and Digital Transformation.

video
slides
// our talk

Blazing-Fast Serverless MapReduce Indexer for Apache Solr

06-11, 14:50–15:30 (Europe/Berlin), Palais Atelier

Indexing data from databases to Apache Solr has always been an open problem: for a while, the data import handler was used even if it was not recommended for production environments. Traditional indexing processes often encounter scalability challenges, especially with large datasets.
In this talk, we explore the architecture and implementation of a serverless MapReduce indexer designed for Apache Solr but extendable to any search engine. By embracing a serverless approach, we can take advantage of the elasticity and scalability offered by cloud services like AWS Lambda, enabling efficient indexing without needing to manage infrastructure.
We dig into the principles of MapReduce, a programming model for processing large datasets, and discuss how it can be adapted for indexing documents into Apache Solr. Using AWS Step Functions to orchestrate multiple Lambdas, we demonstrate how to distribute indexing tasks across multiple resources, achieving parallel processing and significantly reducing indexing times.
Through practical examples, we address key considerations such as data partitioning, fault tolerance, concurrency, and cost.
We also cover integration points with other AWS services such as Amazon S3 for data storage and retrieval, as well as DynamoDB for distributed lock between the lambda instances.

// our speaker

Daniele Antuzi

R&D SOFTWARE ENGINEER @ SEASE

Software engineer passionate about high-performance data structures and algorithms.
He likes studying and experimenting new technologies trying to improve the state of the art.

video
slides

Other posts you may find useful

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.