AI, Main Blog

How to Choose the Right Large Language Model for Your Domain – Open Source Edition

Note on “Open Source” claims

This blog post was originally written before the publication of the Open Source AI Definition, which provides a much-needed clarification on what can truly be considered “open” in the AI space. Since then, it has become increasingly clear that many models labeled as “open source” are in fact not truly open according to this definition – often restricted by licensing terms, usage clauses, or lacking sufficient transparency in training data and weights.

If you’re curious to learn more about this topic, we recommend checking out our blog post on OpenWashing and our analysis of the DeepSeek models, which explore how some companies continue to claim openness while falling short of the actual criteria. This distinction is important when selecting models for your projects, especially when legal clarity or reproducibility matters.

If you’re looking for a guide to the best large language models available at the moment — whether open source, partially open source, or commercial — check out our post: Best Large Language Models You Can Use in 2025 (Full Guide)

Large Language Models (LLM) are ubiquitous nowadays, it’s the new big thing and everyone is talking about the greatest and latest!
But one of the main challenges of starting a project that involves one or more language models is choosing the best for your domain of information.
You shouldn’t neglect this step as it’s fundamental for the success of the project: choose superficially and you may end up with sub-optimal quality for your entire solution, may require additional fine-tuning, or may even end up in extremely poor performance, bringing the need for going back to the choosing phase (with the consequent waste of time and energy).

The process of choosing the Large Language Model for your task would deserve a blog (or even a course) of its own, but we can summarise it as:

- identify clearly the problem you want to solve [Dense Retrieval? Question Answering? Summarisation? etc]

- study your content and language

- identify a pre-trained model compatible with your data

- check if there’s any fine-tuned version of the model you selected for the task you are trying to solve
- fine-tune the model with your data if necessary

In this blog post, we want to give you an overview of the best Large Language Models available now (for the date refer to the latest update of this blog post) by domain.
We’ll focus on open source licensed models and we’ll follow up with another blog mentioning proprietary closed solutions.
Large language modeling is a hot, fast-paced field, so I would also recommend keeping an eye on the official Hugging Face leaderboard.

Hugging Face is the ‘de facto’ community for Machine Learning and includes a huge variety of open datasets and models.
You can use it as a reference for many of the models we present in this blog post, to explore their details and download them.
We would also like to thank this interesting pre-print as a strong source of inspiration for the blog post: https://arxiv.org/pdf/2305.18703.pdf
Without further ado let’s get it started!

1. Generalists

These models may be a good starting point if you don’t find a model that is specialized for your domain and task.
They have been trained mostly on web data and highly curated datasets (you find the details on each model card).
Generalist Large Language Models are advancing fast and present high-quality performance on many tasks, even surpassing in-domain models for some.

Falcon

official website

Falcon is a group of state-of-the-art language models created by the Technology Innovation Institute in Abu Dhabi, and released under the Apache 2.0 license.
According to their press release and upcoming paper Falcon-40B is the first “truly open” model with capabilities rivaling many current closed-source models.

Pre-trained

https://huggingface.co/tiiuae/falcon-40b
This is the biggest of the family, expensive from computation requirements but rivaling many high-quality closed models.

https://huggingface.co/tiiuae/falcon-7b

A smaller version, with excellent quality for the size.

Fine-tuned to follow instructions

https://huggingface.co/tiiuae/falcon-40b-instruct

This is the biggest Falcon model, fine-tuned for chat interaction (chatGPT style).

https://huggingface.co/tiiuae/falcon-7b-instruct

This is the smaller version of Falcon, fine-tuned for chat interaction (chatGPT style).

LLaMA

official website

LLaMA (Large Language Model Meta AI) is a group of state-of-the-art language models created by META, and released under the GNU General Public License v3.0 license.
For more information:
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/

You can find many different adaptations of LLaMA models, with different sizes and different fine-tuning, on Hugging Face you can refer to this list (ordered by downloads):
https://huggingface.co/models?other=llama

Let’s see some of them:

https://huggingface.co/decapoda-research/llama-7b-hf
The 7b (7 billion parameters, the smallest) version of LLaMA in the Hugging Face format.

Fine-tuned to follow instructions

There are many LLaMA fine-tuned models around for different tasks, some examples here:

Stanford-Alpaca

official website

Starting from Meta’s LLaMA 7B model. Alpaca has been fine-tuned on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003.
On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003, but is also surprisingly small and easy/cheap to reproduce.
For more information:
https://crfm.stanford.edu/2023/03/13/alpaca.html

Vicuna

official website

Vicuna was fine-tuned between March 2023 and April 2023 from LLaMA-7b using 70K conversations collected from ShareGPT.com.

Nous Hermes

official website

Starting from a LLaMA-13b the model was fine-tuned almost entirely on synthetic GPT-4 outputs. This includes data from diverse sources such as GPTeacher, the general, roleplay v1&2, code instruct datasets, Nous Instruct & PDACTL (unpublished), CodeAlpaca, Evol_Instruct Uncensored, GPT4-LLM, and Unnatural Instructions.

2. E-commerce

When dealing with e-commerce short texts (mostly title, description, and some metadata) using sentence-transformers can be a good idea: they are fine-tuned large language models specialized in encoding sentences to vectors.
Among them: all-MiniLM-62-v2 is probably the smallest and most often used.
Fine-tuned from a mini version of a Microsoft pre-trained: MiniLM-L12-H384-uncased
But many variants are available on hugging face: sentence-transformers

Also remember that e-commerce it’s not a domain itself, so depending on the type of content you sell it may be good to keep an eye on both generalist and specialist large language models below.

3. Specialist

• Fundamental Biomedicine Science

Each of these large language models is specialized in a different language, from general chemical molecules to DNA and proteins:

MoLFormer - Chemical molecules

official website

The model is pre-trained on a large collection of chemical molecules represented by SMILES sequences from two public chemical datasets PubChem and Zinc in a self-supervised fashion.
This model can be used for different downstream tasks such as molecule similarity and molecule property predictions.

Nucleotide Transformer - DNA sequences

official website

This family of language models was pre-trained on DNA sequences from whole genomes.
It leverages DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide range of species.
It offers extremely accurate molecular phenotype prediction compared to existing methods.

Evolutionary Scale Modeling - Proteins

official website

This repository contains code and pre-trained weights for Transformer protein language models from the Meta Fundamental AI Research Protein Team (FAIR), including their state-of-the-art ESM-2, ESMFold and others.
If you want to know more we recommend the 2019 preprint of the paper “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”.
ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks. ESMFold harnesses the ESM-2 language model to generate accurate structure predictions end to end directly from the sequence of a protein.

• Biomedical - Clinical Healthcare Support

The models in this list were mostly pre-trained on large-scale biomedical literature and some of them integrates also clinical notes and additional sources of information:

BioGPT

official website

This model has been pre-trained on large-scale biomedical literature using a GPT-2 architecture and achieves impressive results on most biomedical NLP tasks.

BioMedLM (previously known as PubMedGPT)

official website

BioMedLM 2.7B is a new language model trained exclusively on biomedical abstracts and papers from The Pile.
This GPT-style model can achieve strong results on a variety of biomedical NLP tasks, including a new state-of-the-art performance of 50.3% accuracy on the MedQA biomedical question-answering task.

GatorTron

official website

This family of large language models was pre-trained on a large corpus consisting of over 90 billion words from clinical narratives and scientific literature.
It’s based on a BERT architecture.

• Geospatial Semantic Tasks

These tasks involve for example detecting named locations within a given text snippet or identifying more detailed location descriptions, like home addresses, highways, roads, and administrative regions, from a text snippet such as tweets.
We couldn’t find any modern open source large language model specific to the domain, looking around many findings suggest that task-agnostic LLMs can surpass task-specific, fully-supervised models in geo tasks at the moment.
But I am sure more will come, so feel free to add a comment if/when a dedicated LLM comes around!

• Finance

FinBERT

official website

This model starts from a pre-trained BERT, further training is done on a subset of the Reuters TRC2 dataset and finally, for the sentiment analysis fine-tuning, they used Financial PhraseBank from Malo et al. (2014).

• Legal

LEGAL-BERT

official website

LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications.
To pre-train the different variations of LEGAL-BERT, the authors collected 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources.

• Software Engineering (Code)

Large language models trained on code (LLMC) are the specialized versions of LLM trained on code, such, let’s see some examples:

Star Coder

official website

The StarCoderBase models are 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2), with opt-out requests excluded. The model uses Multi Query Attention, a context window of 8192 tokens, and was trained using the Fill-in-the-Middle objective on 1 trillion tokens.

That’s it for now!
Updates and comments will follow as we are in a very dynamic time for Search and language modeling!

Need Help with this topic?

As we said at the beginning, choosing the right Large Language Model can be difficult, and making the wrong choice can affect the quality of your entire solution. If you need to find the best Large Language Model for your domain and language, contact us now, and we'll be delighted to help you!

Click Here

ai, deep learning, information retrieval, large language model, LLM, machine learning, search

Sign up for our Newsletter

Did you like this post? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!

About the company

about our work

Rated Ranking Evaluator (RRE)

Rated Ranking Evaluator Enterprise (RREE)

Apache Solr LLM Highlighter plugin

News

Main Blog

TIPS AND TRICKS

LATEST BLOG POST

contact us

Don't miss all the news - subscribe to our newsletter!

How to Choose the Right Large Language Model for Your Domain – Open Source Edition

Note on “Open Source” claims

1. Generalists

Falcon

Pre-trained

Fine-tuned to follow instructions

LLaMA

Fine-tuned to follow instructions

Stanford-Alpaca

Vicuna

Nous Hermes

2. E-commerce

3. Specialist

• Fundamental Biomedicine Science

MoLFormer - Chemical molecules

Nucleotide Transformer - DNA sequences

Evolutionary Scale Modeling - Proteins

• Biomedical - Clinical Healthcare Support

BioGPT

BioMedLM (previously known as PubMedGPT)

GatorTron

• Geospatial Semantic Tasks

• Finance

FinBERT

• Legal

LEGAL-BERT

• Software Engineering (Code)

Star Coder

Need Help with this topic?​

Other posts you may find useful

AI-Powered Search Results Navigation with LLMs & JSON Schema

RRE-Enterprise: How to Run an Evaluation

SolrCloud exceptions with Apache Zookeeper

Alessandro Benedetti

Alessandro Benedetti

Follow Us

Top Categories

Recent Posts

Retrieval and Responsibility: The Ethics of Augmented Knowledge

Faster Vector Search: Early Termination Strategy Now in Apache Solr

OpenSearch and Large Language Models

Monthly video

Sign up for our Newsletter

Leave a Reply Cancel reply

Rated Ranking Evaluator
(RRE)

Need Help with this topic?