Are LLMs and In-Context Learning Enough for NLP?

A case for smaller specialised models

Jose Camacho Collados
6 min readAug 24, 2024

With the growth of Large Language Models (LLMs), I’ve been wondering whether smaller models still have a role in Natural Language Processing (NLP). While in purely generative tasks I believe there is nothing that comes close to LLMs, there are other focused tasks where smaller models can still play an important role (and even be better!).

Context: LLMs have been shown tremendous success in NLP and beyond. Part of their success is their ability to solve tasks without explicit training data. These models can provide reliable answers in many different applications just by explaining them the task, and/or by showing a handful of examples in a prompt, which is known as In-Context Learning (ICL).

In one of our recent papers titled “Language Models for Text Classification: Is In-Context Learning Enough?” [1], led by Aleksandra Edwards, we analysed whether ICL was enough for solving (simple) classification tasks. For that, we evaluated LLMs such as LLaMA and GPT, and compared them with smaller models (e.g. BERT-like such as RoBERTa) fine-tuned to the task. By performing experiments on 16 heterogeneous datasets, the conclusions were more or less clear: LLMs with ICL can provide strong results but still not as good as RoBERTa fine-tuned models. In fact, this result is consistent for all binary, multiclass or multilabel classification tasks.

Aggregated results across different datasets. On the left, models using prompting/ICL. On the right, models using fine-tuning.

We presented this paper at the LREC-COLING 2024 conference in Turin (Italy). We had many people coming to our poster presentation, both from industry and academia, almost all of them with clear opinions on the subject! There were roughly two groups of people coming to the poster, which I’ll refer to as “LLM skepticals” and “LLM believers”:

  • LLM skepticals: “I’ve been trying to use LLMs for task X, but a simple BERT classifier was always better!”; “I’ve been going crazy as I thought I was the only one for which this happened!”
  • LLM believers: “If you do better prompt engineering, LLMs will have better performance”; “The experimental setting was not fair”; “You haven’t used the latest LLaMA-3 or GPT4-o in your experiments”
Presenting “Language Models for Text Classification: Is In-Context Learning Enough?” at LREC-COLING 2024.

The discussions were very enriching and I believe everybody was partly right! Depending on your level of expertise and the problem you are trying to solve, you may opt for a simple solution based on a smaller specialised model, or an LLM well optimised for the task if you have sufficient resources and expertise.

Initially, I thought that it was only me who wasn’t perhaps up to date with the latest prompting or ICL techniques. However, almost simultaneously, there have been several other papers showing similar findings for other tasks and using different experimental settings [2–4], all showing how smaller specialised or fine-tuned models can compete or outperform large models using only in-context learning. Or, more generally, that fine-tuning may still be required for better generalisation [5]. While not the main focus, we also found similar trends in another paper led by Dimosthenis Antypas on the social media domain where we constructed SuperTweetEval, a unified NLP benchmark for Twitter [6]. Similarly to the above-mentioned findings, an initial evaluation showed that smaller models proved generally more reliable than LLMs with ICL.

[Shameless plug] If you are interested in using these specialised models for your social media research, check out our TweetNLP models in Hugging Face! In fact, it is particularly due to the unexpected success of these models, even in the age of LLMs (regularly among the top most downloaded models in the Hugging Face hub for the past two years) that I decided to study this topic in more detail.

These results for the social media domain can be similarly extrapolated to political science [3] and specialised clinical/biomedical tasks [4]. There are probably other papers in other domains that I have surely missed, but it seems to be a general trend. Given all these results, I started having some conversations with other people about this topic, and seeing some discussions on Twitter, all of which made me realise that I was perhaps not the only one with these feelings or experiences.

TLDR: To sum up, small models that can be run on a standard laptop can often provide efficient and simple solutions for NLP problems for which training data is available (and even provide better performance than LLMs!). In general, I believe that for relatively simple non-generative tasks that do not require a high-level of reasoning, LLMs are often an overkill and may lead to a less-than-optimal performance. This is especially relevant in business/research settings where inference calls can be expensive, or in cases where thousands/millions of predictions are required. In particular, I would just want to highlight that encoder BERT-like models such as RoBERTa can still be quite powerful for classification tasks! 🔥🔥

In my opinion, as a research community we moved perhaps too quickly to focus solely on LLMs whereas there are still many valid applications and research to be done for these “smaller” models that revolutionised the NLP landscape not long ago!

Of course, this may very well change in the near future! With the amazing progress in LLMs (and also on making them more efficient/smaller), these can quickly catch up and prove us wrong once again. Nonetheless, in practical settings this solution may still be far away, and fundamental issues or tradeoffs will probably remain. For example, in a recent keynote at ACL 2024, Sunita Sarawagi rigurously showed the trade-offs of different methods for solving NLP tasks, including fine-tuning and in-context learning. One of the conclusions that I took away from her excellent talk is that no method is ideal in all circumstances (at least among those methods we know of!), and there will always be some trade-offs.

Extra: If you are interested in this topic, I recently gave two talks at the University of Cambridge and at the KAIST NLP Workshop in Seoul. A simplified version of the slides can be found here.

— — — — — -

References

[1] Aleksandra Edwards and Jose Camacho-Collados. 2024. Language Models for Text Classification: Is In-Context Learning Enough?. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024). Torino, Italia.

[2] Martin Juan José Bucher and Marco Martini. 2024. Fine-Tuned ‘Small’ LLMs (Still) Significantly Outperform Zero-Shot Generative AI Models in Text Classification.” arXiv preprint arXiv:2406.08660 (2024).

[3] Mitchell Bosley, et al. 2023. Do we still need BERT in the age of GPT? Comparing the benefits of domain-adaptation and in-context-learning approaches to using LLMs for Political Science Research. (2023).

[4] Yanis Labrak, Mickael Rouvier, and Richard Dufour. 2024. A Zero-shot and Few-shot Study of Instruction-Finetuned Large Language Models Applied to Clinical and Biomedical Tasks. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), Torino, Italia.

[5] Marius Mosbach, Tiago Pimentel, Shauli Ravfogel, Dietrich Klakow, and Yanai Elazar. 2023. Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. In Findings of the Association for Computational Linguistics: ACL 2023, Toronto, Canada.

[6] Dimosthenis Antypas, Asahi Ushio, Francesco Barbieri, Leonardo Neves, Kiamehr Rezaee, Luis Espinosa-Anke, Jiaxin Pei, and Jose Camacho-Collados. 2023. SuperTweetEval: A Challenging, Unified and Heterogeneous Benchmark for Social Media NLP Research. In Findings of the Association for Computational Linguistics: EMNLP 2023, Singapore.

--

--