Image Retrieval, Main Blog

Image Retrieval Using ViT + Generative Pre-trained Transformer (GPT)

SQUID
April 18, 2024
12 mins read

Suppose you have a set of images and you want to retrieve the most relevant ones through a textual query. How to do it?

A popular option is to store in the search engine a textual description of the images and use it for lexical matching with the textual query. Thanks to the latest improvements in Computer Vision and Artificial Intelligence, this description can be generated automatically from the image with what is called image captioning.
In this blog post, we would like to show you an example of this approach. We will start with how to generate an image description exploiting a Vision Transformer (ViT) and a Generative pre-trained Transformer (GPT) and finish with the creation of a Lucene searcher, all accompanied by code examples.

Feature Extraction With ViT For Image Retrieval

Vision Transformer (ViT) is a pioneering model introduced in 2020 that has significantly transformed the field of Computer Vision. Unlike traditional Convolutional Neural Networks (CNNs), ViT adopts a distinct approach to extracting visual features from input images, leading to remarkable advancements in image understanding.

ViT phases for image retrieval

ViT’s features extraction process encompasses several key steps that enable it to capture meaningful information from images:

Input Encoding: segments the input image into a fixed number of patches. Each patch is linearly projected into a low-dimensional representation called embedding, this refers to a compressed version of data that captures the essential information while reducing the overall dimensionality of the original data. This segmentation allows ViT to process the image as a sequence of patches that facilitates subsequent analysis.
Positional Encoding: to preserve the spatial information of each patch, positional encoding is added to the patch embedding, enabling the model to comprehend the relative positions of the patches within the image. It helps the model gain an understanding of the image’s structure. The positional encoding scheme typically consists of a set of sinusoidal functions with different frequencies and phases, ensuring its uniqueness.
Transformer Encoder: The patch embeddings and positional encoding are fed into a transformer encoder, comprising multiple layers each containing self-attention mechanisms and feed-forward neural networks. The self-attention mechanism operates by calculating the attention weights between patches and computing the dot product between a query vector, a key vector, and a value vector for each patch. This dot product measures the similarity or relevance between patches. The self-attention mechanism enables the model to attend to various patches, learn their dependencies, and enhance its ability to discern meaningful visual patterns.
Classification Head: the final layer of the transformer produces the class predictions for each patch embedding in the given image. By leveraging the learned features and contextual information acquired through the transformer encoder, ViT enables accurate image classification and semantic understanding.

Implementation Of Image Captioning

Let’s dive into the implementation of image captioning. We use the vit-gpt2-image-captioning pre-trained model provided by Huggingface; which is composed of a Vision Transformer to extract features from images (encoder) and a Generative pre-trained Transformer to convert the features into a textual representation of the input images (decoder).
In particular, GPT is responsible for creating a description based on the classes returned by ViT.

To do that, we use Python.

Put the images (e.g. Flickr Image Dataset) you want to transform in an accessible folder (e.g. images/)
Import the needed dependencies, pytorch, and transformers:

				
					import glob
import torch
import os
from PIL import Image
from transformers import VisionEncoderDecoderModel, GPT2TokenizerFast, ViTImageProcessor

Check for the presence of a CUDA device, otherwise, use CPU and initialize the ViT+GPT model, the corresponding tokenizer, and image_processor:

				
					# check if the CUDA device is available
device = "cuda" if torch.cuda.is_available() else "cpu"

# load the pre-trained model
model = VisionEncoderDecoderModel
        .from_pretrained("nlpconnect/vit-gpt2-image-captioning")
        .to(device)
tokenizer = GPT2TokenizerFast
            .from_pretrained("nlpconnect/vit-gpt2-image-captioning")
image_processor = ViTImageProcessor
                .from_pretrained("nlpconnect/vit-gpt2-image-captioning")

Here:

model is the ViT + GPT model used for image captioning
tokenizer is the component needed by GPT to generate the textual representation of the image
image_processor is the component in charge of preparing input features for the vision model. This includes transformations such as resizing, normalization, and conversion to PyTorch, TensorFlow, Flax, and Numpy tensors.

Define a function that calls the model methods to get the caption from an image:

				
					# get description from image
def get_caption(model, image_processor, tokenizer, image_path, device):
    image = Image.open(image_path)
    # preprocess the image
    img = image_processor(image, return_tensors="pt").to(device)
    # generate the description from the returned tensor
    output = model.generate(**img)
    # decode the output of the model
    caption = tokenizer.batch_decode(output, skip_special_tokens=True)[0]
    return caption

Process all the images inside a given folder and create the image_collection.txt file containing all the corresponding captions that will be indexed in the Lucene searcher. The document format we choose to use is the TREC format:

				
					images_path = "images\\*"
    
images_list = [f for f in glob.glob(images_path)]
 
total = len(images_list)
i = 1
for img in images_list:
    text = get_caption(model, image_processor, tokenizer, img, device)
    print("%i/%i - %s" % (i, total, text))
    img_id = os.path.basename(img)
    doc_str = "<DOC>\n<DOCNO>" + i + "</DOCNO>\n<DOCID>" + img_id + "  
               </DOCID>\n<TEXT>\n" + text + "\n</TEXT>\n</DOC>\n"
    with open("docs/image_collection.txt", "a") as text_file:
        text_file.write("%s" % doc_str)
    i = i + 1

For each image, the output will be a <DOC> element with this format:

				
					<DOC>
  <DOCNO>1</DOCNO>
  <DOCID>image_name.jpg</DOCID>
  <TEXT>
    image caption text
  </TEXT>
</DOC>

Examples of images with respective predicted descriptions:

Implementation Of An Image Searcher On Lucene

Given the image_collection.txt file with all the docs containing the description of the images and the images’ file names, we need to implement the actual retrieval system. To accomplish this, the implementation leverages the Java library of Lucene, which facilitates the creation of the retrieval system. The main components of the system consist of the Parser, Analyzer, Indexer, and Searcher:

Parser: removes tags from the document;
Analyzer: processes the documents’ text into a stream of tokens. Used by Indexer and Searcher;
Indexer: builds the inverted index given the streams of tokens generated by the Analyzer on each parsed document;
Searcher: searches in the index for the documents related to a given analyzed query.

				
					// image_collection.txt path folder
final String docPath = "experiment/docs";
// index folder
final String indexPath = "experiment/index_images";
// path to the queries
final String queries = "experiment/queries/queries.txt";
// results folder
final String resultsPath = "experiment/results";
// result file name
String resultID = "result_image_retrieval";

// setup the textual analyzer that will process our description field
final Analyzer textAnalyzer = new EnglishAnalyzer();
// similarity function to use during search
final Similarity sim = new BM25Similarity();
// max number of documents to retrieve
final int maxDocsRetrieved = 1000;

// setup the indexer
final Indexer i = new Indexer(textAnalyzer, sim, indexPath, docPath, Parser.class);
// create the index
i.index(); 

// setup the searcher
final Searcher s = new Searcher(textAnalyzer, sim, indexPath, topics, resultID, resultsPath, maxDocsRetrieved);
// performs the searching
s.search();

In the code above Indexer and Searcher are only placeholder classes for this explanation. You can therefore choose to implement whatever Indexer and Searcher you like.
In this case, the program saves the ranked results within the run_image_retrieval.txt file.

Here is the first result for the query “Dog running” using our implementation of Indexer and Searcher:

Conclusions

In this blog post we have seen how to retrieve images through a textual query. We have described how the Vision Transformer model works and how it can be used, in combination with a Generative pre-trained Transformer, for our task. We have also shown how to do image captioning exploiting a ViT+GPT model and how to implement a Lucene search engine for image retrieval through code examples.

Do You Want To Be Published?

This blog post is part of our collaboration with the University of Padua. If you are a University student or professor and want to collaborate, contact us through e-mail.

Do You Want To Be Published?

This blog post is part of our collaboration with the University of Padua. If you are a University student or professor and want to collaborate, contact us through e-mail.

Click Here

image retrieval, information retrieval, lucene, machine learning, search

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator
(RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Image Retrieval Using ViT + Generative Pre-trained Transformer (GPT)

Feature Extraction With ViT For Image Retrieval

ViT phases for image retrieval

Implementation Of Image Captioning

Implementation Of An Image Searcher On Lucene

Conclusions

Do You Want To Be Published?

Do You Want To Be Published?

Other posts you may find useful

DocValues VS Stored Fields: Apache Solr Features and Performance SmackDown

Pick the Best Database Type for Your Next Project

RRE-Enterprise: Evaluation Explore/Compare Dashboard

SQUID

SQUID

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Quick Links

Services

Subscribe

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

Image Retrieval Using ViT + Generative Pre-trained Transformer (GPT)

Feature Extraction With ViT For Image Retrieval

ViT phases for image retrieval

Implementation Of Image Captioning

Implementation Of An Image Searcher On Lucene

Conclusions

Do You Want To Be Published?​

Do You Want To Be Published?

Other posts you may find useful

DocValues VS Stored Fields: Apache Solr Features and Performance SmackDown

Pick the Best Database Type for Your Next Project

RRE-Enterprise: Evaluation Explore/Compare Dashboard

SQUID

SQUID

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Do You Want To Be Published?