Assessing a new Language Model strategy.
In this article, I will introduce what I consider a new set of techniques, aimed at helping us extract more from language models for building applications.
Leveraging multiple diverse Language models to respond to a user’s query.
When it comes to building and leveraging large language models — we have seen important contributions using 3 core ideas or techniques:
Let us consider a possible emerging category of techniques — one that involves ‘blending’ together multiple diverse language models. The key idea is to present the same question or context to each model and use a single or multiple answers to respond to the user. This blending approach treats each model as a ‘black box,’ meaning these models can have different architectures and be trained on various data sets.
Now the key question is — why? why bother to utilize multiple models?
One core reason is that although LLM’s ( 175+ Billion parameters ) like ChatGPT (GPT 3) can be leveraged in multiple tasks with high accuracy — large models are simply very expensive to use.
Second, LLM leaderboards are dominated by proprietary models and it is in everyones interest to find optimum ways to leverage open source model(s).
Interestingly, it has been found that using a combination of multiple models is indeed better than simply picking any one of them for open-domain chat. In LLM-Blender [2], the authors evaluated various models against a custom-defined set of 5000 questions and found that the best answers varied among different models (see Fig 1), with no single model emerging as the clear winner. The real complexity, of course, lies in automatically selecting the best answer for each question and potentially combining the top answers into one superior response. The downside I see here is the additional response time.
Also, we can use blending to generate more diverse responses by randomly selecting the response from each language model for every potential question. In a specific set of experiments, this approach was found to increase user retention and engagement [3]
As a potential benefit, we could utilize outliers in the answers of the LLM to help identify and mitigate hallucinations and biases, thus providing a better overall user experience.
I am personally curious if blending can be used for some degree of personalization — like shaping responses as per the users reading level or knowledge.
The concept of using multiple models for a single output is well-established in machine learning, with several ensemble techniques already mainstream, such as multiple decision trees in RandomForest/AdaBoost, or other mechanisms for combining diverse models types like Voting. A detailed explanation is available in various sources including Wikipedia [4]. These techniques improve performance in classification and regression tasks. Therefore, the idea of ensembling multiple language models for natural language processing tasks appears to be a natural extension of these existing methods.
How useful can blending be? To accurately judge its utility, we need to see much more work in this area, similar to the extensive research conducted with RAG. Perhaps in matching the performance of large language models, a combination of various techniques applied to smaller models could be the answer. For example, blending smaller models , RAG and active learning together might provide an effective solution.
[1] https://www.microsoft.com/en-us/research/blog/phi-2-the-surprising-power-of-small-language-models/
[2] LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion : https://arxiv.org/pdf/2306.02561.pdf
[3] Blending Is All You Need: Cheaper, Better Alternative to Trillion-Parameters LLM : https://arxiv.org/abs/2401.02994
Innovator| AI-NLP Researcher | Developer.