Event

Our Berlin Buzzwords 2023

Berlin Buzzwords was BIG for Sease this year:

Alessandro had a session about Multi-valued Vectors in Apache Lucene and was invited on the main stage as the Apache Solr representative at the “Which Search Engine? The Debate”:

Introducing Multi-valued Vector Fields in Apache Lucene
The Debate Returns (with more vectors) Which Search Engine?

Anna and Ilaria on the other hand had a wonderful session on how to use Elastic Kibana for online search quality evaluation:

This blog post summarises their experiences in regard to the visit to Berlin in June, for Berlin Buzzwords 2023 “Germany’s most exciting conference on storing, processing, streaming, and searching large amounts of digital data, with a focus on open source software projects.”

A follow-up post on MICES will be published soon, so stay tuned!

Alessandro Benedetti

DIRECTOR @ SEASE

It has been a fantastic experience, probably my best Berlin Buzzwords so far:
the conference and organization were flawless, starting from the support to us as speakers to the recording team, that was able to upload the videos in record time.

I’ll summarise here my favorite talks:

Introducing Multi-valued Vector Fields in Apache Lucene

Ok, I’m biased toward this one 😀 But I honestly didn’t think many people would be interested in a lot of Lucene’s internals but the story and details of the contribution worked surprisingly well, the room was almost full and I got many questions!
Hopefully, this will help in finalizing the contribution and making it available to a larger audience!

The Debate Returns (with more vectors) Which Search Engine?

I got a relatively last-minute invitation from the Solr PMC to join the debate on stage and it’s been a fantastic experience.

I represented Solr along with Jo Kristian Bergum(Vespa), Etienne Dilocker(Weaviate), Charlie Hull (moderator), Philipp Krenn (Elasticsearch), and Kacper Łukawski (Qdrant).

The debate was polite, informative, and up-to-date (2023).
I was able to highlight the benefits of an independent Open source solution such as Apache Solr in comparison to others backed by one company and everyone had time to showcase the use cases where each of the solutions shines.
At the same time, each of us also explained the honest weaknesses and areas we need to improve internally. I definitely recommend watching the recording if you missed it.

What defines the “open” in “open AI”?

This was the opening keynote and initiated a nice analysis/discussion on what it means to be open in the field of AI (mostly about Large Language Modeling).
The talk was clear, and the speaker was competent and it definitely opened the conference in the right way (given the number of talks on the topic we were going to see soon).

Vectorize Your Open Source Search Engine

This short talk from our friend Atita gave a nice overview on the topic of moving to/adding vector-based search to your search engine.
Super fast and informative!

How to train your general purpose document retriever model

I really enjoyed the talk as it explored the journey of Elasticsearch into the Learned Sparse Retrieval models they recently released for their enterprise license (ELSER).
Definitely, an area I want to explore more personally, a lot of interesting pointers and papers to read!

Boosting Ranking Performance with Minimal Supervision

When there’s Jo on the stage, you know there is going to be a lot of value already.
The talk was informative, giving a high-level overview of many of the problems we face today in integrating AI techniques with ranking.
The part I found the most interesting is using Large Language Models to label training data and help humans in the delicate task of gathering judgments.
Have we finally solved one of the most annoying problems in search quality evaluation and learning to rank?

What's coming next with Apache Lucene?

It’s always a pleasure to see a summary of the directions we are taking (we as a committer). Uwe did a great job showing the arrival of function queries for vector similarity scoring and the upcoming support for Java Panama API to speed up vector similarity calculations.
A lot of cool stuff arriving next week with Lucene 9.7!

Connect GPT with your data: Retrieval-augmented Generation

Hallucinating is one of the most important problems we are experiencing with large language models and this talk explained why this happens (in most cases) and how we can leverage Information Retrieval systems (both lexical or neural) to ground the Large Language Model with a properly designed prompt and context.
Informative and with practical examples, I enjoyed the talk and it’s going to definitely inspire many future contributions.

How to Implement Online Search Quality Evaluation with Kibana

I’m biased also toward this one 🙂
My colleagues Anna and Ilaria did an excellent job in expanding on an experience we started with a client of ours. Search Quality evaluation is often underestimated and I hope that showing practical examples can help companies in applying it to have a better understanding of how their search engine works for satisfying their users.

Rethinking Autoscaling for Apache Solr using Kubernetes

Shifting from Ai integrations (main topic so far) to scalability, this talk from my fellow Solr committer and PMC member Houston explores new approaches in Apache Solr autoscaling with Kubernetes.
Can’t wait for the release! (9.3 should be out soon).

ChatGPT is lying, how can we fix it?

This talk has been plagued by a lot of technical issues with the projector, but the presenter managed to keep calm and perform an interesting overview of hallucinations and grounding through retrieval-augmented strategies.
Hopefully, the editing will do the trick and the talk will be more accessible online!

Fact-Checking Rocks: how to build a fact-checking system

I loved this talk! Fact-checking is becoming more and more important nowadays, especially now that’s so easy to generate plausible and well-written content (thanks to Large Language Models!).
Stefano walked us through the architecture and internals of his project, entirely based on Open source techs such as Haystack, FAISS, Hugging Face Transformers, and Sentence Transformers.

Learning to hybrid search

If I had to choose one, this was my favorite talk! Roman and Vsevolod went straight to the point:
pick a ranking problem using a publicly available dataset and show how different rankings approach perform, from basic BM25 to learned models using metadata to sentence transformers and back.
Sure, it’s just one dataset and this doesn’t mean that the rankers tried will perform the same with your dataset, but it’s an interesting survey and definitely an interesting milestone for more studies to come!

ANNA RUGGERO

R&D SOFTWARE ENGINEER @ SEASE

This was my first time at Berlin Buzzword. What an amazing experience!
A lot of great talks, fantastic people, and a beautiful location. Also, this was my first opportunity as a speaker and it was an honor to see a lot of people interested in the talk I had with my colleague Ilaria.

Let’s have a look at my favorite talks!

Vectorize Your Open Source Search Engine

A nice overview by Atita Arora on the potential and challenges of vector search. A very good talk for those approaching the topic. Thanks, Atita!

Supercharging your transformers with synthetic query generation and lexical search

What an inspiring talk! Here Milind Shyani describes how to fine-tune small models to achieve very good performance while taking advance of their light model size and computational effort. An in-depth presentation of the process and its challenges had been given. Something I will for sure put my hands on to try!

Boosting Ranking Performance with Minimal Supervision

I couldn’t miss Jo Kristian Bergum’s talk about boosting ranking performance with synthetic data! Here is another great description of how to fine-tune a ranking model when no behavioral data is available, a very common problem in nowadays companies.

Introducing Multi-valued Vector Fields in Apache Lucene

How can I not say a few words about Alessandro Benedetti’s talk about his contribution to multi-valued vector fields in Lucene 😀

This a very good explanation of the steps made so far in Lucene and why this feature can really give a boost to vector search!

The Debate Returns (with more vectors) Which Search Engine?

What an interesting panel led by Charlie Hull! As we know there is no “easy answer” to this question and it was great to see all those experts describing how their technologies behave with respect to many different applications and challenges!

Connect GPT with your data: Retrieval-augmented Generation

Who among us has not had to deal with the problem of hallucinations when dealing with a language model? Thanks, Malte Pietsch for this great presentation on possible approaches to reduce the issue and improve the quality of the results.

ChatGPT is lying, how can we fix it?

A great talk from Kacper Łukawski on the problem of factuality for language models. A lot of useful strategies to face the problem from a technical point of view, exploiting knowledge bases, prompt engineering, and a lot more.

Learning to hybrid search

Roman Grebennikov and Vsevolod Goloviznin present a very interesting talk on the combination and evaluation of many different ranking models. It was great to explore and compare traditional search, learning to rank, and the newest neural approaches on the same baseline!

How to Implement Online Search Quality Evaluation with Kibana

Last but not least my and Ilaria’s talk on how to implement an online evaluation tool with Kibana!
It was great to see so many people interested in the topic. You can also have a look at the youtube video below if you are curious about the topic and want to go in-depth in model evaluation.
Thanks again Berlin Buzzwords for the opportunity to be both an attendee and a speaker at this amazing event!

Ilaria Petreti
ILARIA PETRETI

R&D SOFTWARE ENGINEER @ SEASE

Vectorize Your Open Source Search Engine

Atita Arora gives a great overview to those that are fascinated by vector search but don’t know where to start. She discussed the integration of vector search into Chorus, an open source framework designed for an e-commerce application, which I have never explored and would be interesting to do. Go women of search!

Boosting Ranking Performance with Minimal Supervision

I found Jo Kristian Bergum’s talk to be highly valuable as he introduced an innovative approach for generating labeled data with minimal human supervision and effort. Using Large language models (LLMs), you can easily generate training data for training ranking models.

Introducing Multi-valued Vector Fields in Apache Lucene

Alessandro worked more than a year on this Lucene contribution in his “spare time” and it was very interesting to find out the details of the implementation and how multi-valued fields can work in a vector-based search use-case. It would be even more interesting to see the work completed and to know that it will be useful to the community.

The Debate Returns (with more vectors) Which Search Engine?

I had a fantastic time during the panel discussion led by Charlie Hull, where five representatives from Solr, ElasticSearch, Vespa, Weaviate, and Qdrant provided valuable insights on choosing the most suitable search engine for specific use cases. They helped us understand the advantages of each approach and how they differ from one another.

Model Fine-tuning For Search: From Algorithms to Infra

Maximilian Werk and Bo Wang had a good talk, discussing the importance of model fine-tuning, the algorithm frameworks behind it, and how to scale the training platform up. <br>It was interesting, but many concepts were discussed in less than 40 minutes, and in my case, it was not very easy to follow. I would need to listen to it again to appreciate it better.

What's coming next with Apache Lucene?

I have recently started contributing to Lucene (and Solr), so I was pleased to follow Uwe Schindler’s talk on Lucene’s future plans; it might be interesting to work on the areas of improvement he mentioned: Vector Search and Performance.

Connect GPT with your data: Retrieval-augmented Generation

I really appreciated Malte Pietsch’s talk about the Retrieval-augmented Generation paradigm.
Putting in production LLMs presents several challenges so he discussed how to use the Retrieval-augmented Generation (so enriching the prompts by adding the relevant retrieved data in the context) to tackle them and he shared useful suggestions and techniques (integrated into the haystack open source framework).

Fact-Checking Rocks: how to build a fact-checking system

This was my first talk about Fact-checking, the process of verifying the accuracy and veracity of information presented as factual, and I am happy that Stefano Fiorucci covered this topic, moreover using a funny use case around rock music. He showed us in detail his project which combines Information Retrieval tools with modern Language Models, using several Python open-source libraries.
I look forward to the opportunity to explore it in more depth and play with the demo.

Learning to hybrid search

Roman Grebennikov and Vsevolod Goloviznin gave a great and entertaining talk and the fact that the Palais Atelier stage was packed is a testament to that.
During their presentation, they demonstrated the effectiveness of employing a hybrid approach that combines BM25, neural embeddings, and client behavior with Learning-to-Rank. This combination resulted in a superior outcome compared to using each of these methods individually.

How to Implement Online Search Quality Evaluation with Kibana

Last but not least, our talk! 
Since I joined Sease, I have almost always worked on projects alongside my brilliant colleague, Anna. We developed a perfect working relationship and sharing this important stage with her has been a wonderful experience!
We provided an in-depth exploration of a custom approach we implemented for one of our clients to evaluate online ranking models using Kibana.
Even though we were in the Frannz Salon which was smaller and farther away than the other stages, the room was almost full, and we are glad that our presentation generated interest.
A big thank you to those who were present!

About the Berlin Buzzwords organization

Finally, I would like to say a few words about the conference in general and thank the organizers for their efforts in making Berlin Buzzwords an extraordinary event.
Despite some stages/rooms being quite warm, the content presented was engaging and covered a wide range of topics. It was a great opportunity to learn from experts in the field and expand my knowledge.
I want to highlight the various networking opportunities at the conference, which have been invaluable.
And I really appreciated their dedication to offering vegan choices that align perfectly with my dietary preferences.

// our service

Did you attend the Berlin Buzzwords conference?

We would love to hear your thoughts on the conference! Leave a comment below and tell us about your experience and your favourite talks.

// STAY ALWAYS UP TO DATE

Subscribe to our newsletter

Did you like this post about our experience at Berlin Buzzwords? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

Author

Alessandro Benedetti

Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.

Leave a comment

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.