How to Choose the Right Large Language Model for Your Domain – Open Source Edition
Large Language Models (LLM) are ubiquitous nowadays, it’s the new big thing and everyone is talking about the greatest and latest!
But one of the main challenges of starting a project that involves one or more language models is choosing the best for your domain of information.
You shouldn’t neglect this step as it’s fundamental for the success of the project: choose superficially and you may end up with sub-optimal quality for your entire solution, may require additional fine-tuning, or may even end up in extremely poor performance, bringing the need for going back to the choosing phase (with the consequent waste of time and energy).
The process of choosing the Large Language Model for your task would deserve a blog (or even a course) of its own, but we can summarise it as:
-
- identify clearly the problem you want to solve [Dense Retrieval? Question Answering? Summarisation? etc]
-
- study your content and language
-
- identify a pre-trained model compatible with your data
-
- check if there’s any fine-tuned version of the model you selected for the task you are trying to solve
- fine-tune the model with your data if necessary
- check if there’s any fine-tuned version of the model you selected for the task you are trying to solve
In this blog post, we want to give you an overview of the best Large Language Models available now (for the date refer to the latest update of this blog post) by domain.
We’ll focus on open source licensed models and we’ll follow up with another blog mentioning proprietary closed solutions.
Large language modeling is a hot, fast-paced field, so I would also recommend keeping an eye on the official Hugging Face leaderboard.
Hugging Face is the ‘de facto’ community for Machine Learning and includes a huge variety of open datasets and models.
You can use it as a reference for many of the models we present in this blog post, to explore their details and download them.
We would also like to thank this interesting pre-print as a strong source of inspiration for the blog post: https://arxiv.org/pdf/2305.18703.pdf
Without further ado let’s get it started!
1. Generalists
These models may be a good starting point if you don’t find a model that is specialized for your domain and task.
They have been trained mostly on web data and highly curated datasets (you find the details on each model card).
Generalist Large Language Models are advancing fast and present high-quality performance on many tasks, even surpassing in-domain models for some.
Falcon
Falcon is a group of state-of-the-art language models created by the Technology Innovation Institute in Abu Dhabi, and released under the Apache 2.0 license.
According to their press release and upcoming paper Falcon-40B is the first “truly open” model with capabilities rivaling many current closed-source models.
Pre-trained
https://huggingface.co/tiiuae/falcon-40b
This is the biggest of the family, expensive from computation requirements but rivaling many high-quality closed models.
https://huggingface.co/tiiuae/falcon-7b
A smaller version, with excellent quality for the size.
Fine-tuned to follow instructions
https://huggingface.co/tiiuae/falcon-40b-instruct
This is the biggest Falcon model, fine-tuned for chat interaction (chatGPT style).
https://huggingface.co/tiiuae/falcon-7b-instruct
This is the smaller version of Falcon, fine-tuned for chat interaction (chatGPT style).
LLaMA
LLaMA (Large Language Model Meta AI) is a group of state-of-the-art language models created by META, and released under the GNU General Public License v3.0 license.
For more information:
https://ai.facebook.com/blog/large-language-model-llama-meta-ai/
You can find many different adaptations of LLaMA models, with different sizes and different fine-tuning, on Hugging Face you can refer to this list (ordered by downloads):
https://huggingface.co/models?other=llama
Let’s see some of them:
https://huggingface.co/decapoda-research/llama-7b-hf
The 7b (7 billion parameters, the smallest) version of LLaMA in the Hugging Face format.
Fine-tuned to follow instructions
There are many LLaMA fine-tuned models around for different tasks, some examples here:
Stanford-Alpaca
Starting from Meta’s LLaMA 7B model. Alpaca has been fine-tuned on 52K instruction-following demonstrations generated in the style of self-instruct using text-davinci-003.
On the self-instruct evaluation set, Alpaca shows many behaviors similar to OpenAI’s text-davinci-003, but is also surprisingly small and easy/cheap to reproduce.
For more information:
https://crfm.stanford.edu/2023/03/13/alpaca.html
Vicuna
Vicuna was fine-tuned between March 2023 and April 2023 from LLaMA-7b using 70K conversations collected from ShareGPT.com.
Nous Hermes
Starting from a LLaMA-13b the model was fine-tuned almost entirely on synthetic GPT-4 outputs. This includes data from diverse sources such as GPTeacher, the general, roleplay v1&2, code instruct datasets, Nous Instruct & PDACTL (unpublished), CodeAlpaca, Evol_Instruct Uncensored, GPT4-LLM, and Unnatural Instructions.
2. E-commerce
When dealing with e-commerce short texts (mostly title, description, and some metadata) using sentence-transformers can be a good idea: they are fine-tuned large language models specialized in encoding sentences to vectors.
Among them:
https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2 is probably the smallest and most often used.
Fine-tuned from a mini version of a Microsoft pre-trained: https://huggingface.co/microsoft/MiniLM-L12-H384-uncased
But many variants are available on hugging face: sentence-transformers
Also remember that e-commerce it’s not a domain itself, so depending on the type of content you sell it may be good to keep an eye on both generalist and specialist large language models below.
3. Specialist
• Fundamental Biomedicine Science
Each of these large language models is specialized in a different language, from general chemical molecules to DNA and proteins:
MoLFormer - Chemical molecules
website: https://github.com/IBM/molformer
The model is pre-trained on a large collection of chemical molecules represented by SMILES sequences from two public chemical datasets PubChem and Zinc in a self-supervised fashion.
This model can be used for different downstream tasks such as molecule similarity and molecule property predictions.
Nucleotide Transformer - DNA sequences
This family of language models was pre-trained on DNA sequences from whole genomes.
It leverages DNA sequences from over 3,200 diverse human genomes and 850 genomes from a wide range of species.
It offers extremely accurate molecular phenotype prediction compared to existing methods.
Evolutionary Scale Modeling - Proteins
This repository contains code and pre-trained weights for Transformer protein language models from the Meta Fundamental AI Research Protein Team (FAIR), including their state-of-the-art ESM-2, ESMFold and others.
If you want to know more we recommend the 2019 preprint of the paper “Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences”.
ESM-2 outperforms all tested single-sequence protein language models across a range of structure prediction tasks. ESMFold harnesses the ESM-2 language model to generate accurate structure predictions end to end directly from the sequence of a protein.
• Biomedical - Clinical Healthcare Support
The models in this list were mostly pre-trained on large-scale biomedical literature and some of them integrates also clinical notes and additional sources of information:
BioGPT
website: https://github.com/microsoft/BioGPT
This model has been pre-trained on large-scale biomedical literature using a GPT-2 architecture and achieves impressive results on most biomedical NLP tasks.
BioMedLM (previously known as PubMedGPT)
BioMedLM 2.7B is a new language model trained exclusively on biomedical abstracts and papers from The Pile.
This GPT-style model can achieve strong results on a variety of biomedical NLP tasks, including a new state-of-the-art performance of 50.3% accuracy on the MedQA biomedical question-answering task.
GatorTron
This family of large language models was pre-trained on a large corpus consisting of over 90 billion words from clinical narratives and scientific literature.
It’s based on a BERT architecture.
• Geospatial Semantic Tasks
These tasks involve for example detecting named locations within a given text snippet or identifying more detailed location descriptions, like home addresses, highways, roads, and administrative regions, from a text snippet such as tweets.
We couldn’t find any modern open source large language model specific to the domain, looking around many findings suggest that task-agnostic LLMs can surpass task-specific, fully-supervised models in geo tasks at the moment.
But I am sure more will come, so feel free to add a comment if/when a dedicated LLM comes around!
• Finance
FinBERT
website: https://github.com/ProsusAI/finBERT
This model starts from a pre-trained BERT, further training is done on a subset of the Reuters TRC2 dataset and finally, for the sentiment analysis fine-tuning, they used Financial PhraseBank from Malo et al. (2014).
• Legal
LEGAL-BERT
LEGAL-BERT is a family of BERT models for the legal domain, intended to assist legal NLP research, computational law, and legal technology applications.
To pre-train the different variations of LEGAL-BERT, the authors collected 12 GB of diverse English legal text from several fields (e.g., legislation, court cases, contracts) scraped from publicly available resources.
• Software Engineering (Code)
Large language models trained on code (LLMC) are the specialized versions of LLM trained on code, such, let’s see some examples:
Star Coder
The StarCoderBase models are 15.5B parameter models trained on 80+ programming languages from The Stack (v1.2), with opt-out requests excluded. The model uses Multi Query Attention, a context window of 8192 tokens, and was trained using the Fill-in-the-Middle objective on 1 trillion tokens.
That’s it for now!
Updates and comments will follow as we are in a very dynamic time for Search and language modeling!
Still struggling with the choice of the right Large Language Model for your project?
As we said at the beginning, choosing the right Large Language Model can be difficult, and making the wrong choice can affect the quality of your entire solution.
If you need to find the best Large Language Model for your domain and language, contact us now, and we’ll be delighted to help you!
Subscribe to our newsletter
Did you like this post about How to Choose the Right Large Language Model for Your Domain – Open Source Edition? Don’t forget to subscribe to our Newsletter to stay always updated in the Information Retrieval world!
Related
Author
Alessandro Benedetti
Alessandro Benedetti is the founder of Sease Ltd. Senior Search Software Engineer, his focus is on R&D in information retrieval, information extraction, natural language processing, and machine learning.