The main objective of this survey is to explore the state of the art of Artificial Intelligence applied to Search in the open-source world.
We start with an introduction, explaining what AI means in Information Retrieval and how it can improve search systems.
The second episode explores all the tools Apache Lucene offers to implement such wonders in your search application.
For the third episode, we stay in the Apache Software Foundation world and we move to Apache Solr, exploring what we can achieve and how.
In the final episode of this series, we deal with Elasticsearch, assessing the current state of developments, officially available functionalities, and nice-to-have additions offered by third parties plugins.
So, without further ado, let’s start!
How does Artificial Intelligence impact search?
Since computing power has strongly and steadily advanced in the recent past, AI has seen a resurgence lately and it is now used in many domains, including software engineering and Information Retrieval (the science that regulates Search Engines and similar systems).
Being a complex and generic topic, many sub-fields of Artificial Intelligence exist, each of them dealing with different technical considerations, goals, and tools they use.
We’ll focus particularly on Machine Learning :
Machine learning (ML) is the study of computer algorithms that improve automatically through experience and by the use of data. It is seen as a part of artificial intelligence. Machine learning algorithms build a model based on sample data, known as “training data“, in order to make predictions or decisions without being explicitly programmed to do so.
So what are the problems AI and Machine Learning may help to solve in Search?
Many actually, let’s see some example :
– natural language processing -> to better understand and model the user information need and corpus of information, text segmentation to target specific passages of information
– image/video recognition -> to extract features and search a multimedia corpus of information
– knowledge representation -> to build better data structures and search algorithms(e.g. vector-based), to identify meaning, synonyms, and relations between terms and concepts, spellcheck
– learning -> to learn relevance ranking functions, to classify query intent and documents, to offer personalized results
To solve these problems and bring interesting new capabilities to your search engine, deep learning comes to the rescue:
Deep learning (also known as deep structured learning) is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.
Deep-learning architectures such as deep neural networks, deep belief networks, graph neural networks, recurrent neural networks and convolutional neural networks have been applied to fields including computer vision, speech recognition, natural language processing, machine translation, bioinformatics, drug design, medical image analysis, material inspection and board game programs, where they have produced results comparable to and in some cases surpassing human expert performance.
Applying deep learning techniques to solve search problems is often called Neural Search (an industry derivation from the academic field of Neural information Retrieval).
We won’t explore the details of how Neural Networks work nor all the possibilities we can achieve with such a multi-faceted technology, so let’s keep our focus on what deep learning can contribute to search:
- a better text representation: moving away from the bag-of-words model (where terms are sequences of characters) to a multi-dimensional numerical(vectorized) approach, able to model terms as semantic units of information linked to each other, with meaning 
- text generation: language modeling techniques flourished and reached mainstream news thanks to outstanding results in generating text that is almost indistinguishable from human-made 
Generating text can be useful in many Information Retrieval areas: query auto-completion, query spellchecking, document summarization, search results explainability (summarizing the information that the document contributes to the user information need)…
Improvements in this field could bring to completely new types of information retrieval systems that behave like human experts: the system won’t just return a list of documents to satisfy your information need but synthesize a comprehensive natural language response backed by supporting evidence(documents).
- a better image/video representation: extracting semantic features from images and videos (such as the objects and entities involved rather than just pixel and color-related information). 
Using large pre-trained models, finely tuned for your use case (potentially using transfer learning techniques) helps to build the foundation in advanced multimedia retrieval, reducing the effort of continuously supervised metadata tagging.
- learning to rank: currently, the vast majority of search engines identify a set of candidate documents from the corpus of information (matching) and order them by relevance to satisfy the user information need(ranking).
Providing the most useful results first in the ranked list is fundamental: with deep learning is possible to train advanced relevance ranking models from past interactions/judgments to rank documents for a given query (both represented as numerical vectors) 
- a better machine translation: having a computer able to translate languages with the quality of human experts has always been a challenge. Deep Learning managed to replace approaches such as rule-based systems and statistical phrase-based methods. 
This brings huge benefits for multi-lingual search: you may query in a language and find documents in many different languages much more efficiently.
From this introduction, we notice that many of the deep learning contributions to Search require supporting multi-dimensional numerical vectors in our search engine.
So how can you implement such wonders with currently available open-source technologies?
What is officially supported? Where do we need third-party plugins?
The next episode of this series explores Apache Lucene from this perspective! Stay tuned!