The new LLaVa 1.6 model was released and it's actually good.
We already knew open source multimodal models like LLaVa existed, but if youāre like me, they probably werenāt your first choice for any task.
That changes with this update, for me at least.
The new LLaVa 1.6 model was released recently, as the successor to the previous to LLaVa 1.5, released in the October of last year.
So ā what changed from then?
If youāve used LLaVA-1.5 before, youād definitely feel the difference.
But how does it compare to the other models out there?
As you can see, LLaVAās scores have gone up by leaps and bounds with this new model. Itās even better than some commercial models, like Gemini Pro!
Letās see it in action!
Letās imagine we have a very simple product identification task.
Letās see how LLaVa-1.6 performs. First letās see the 34B model in action. We can check out their demo for that (mostly coz my laptop canāt handle it š„²).
With a very simple prompt, Iām asking it to tell me the name of the product. Itās able to get āAustralian Pork & Beef Bolognese Minceā. Thatās already pretty good! I would like it to add the variations as well, but a little prompting should be enough to get it to perform how I want it.
This time, Iāll try it locally on my laptop using Ollama. But since my laptop is weak, I can only use the 13B model. Iāll be using the same image to test as well. Letās see how this goes!
Using the 13B model, we get āAustraliaās Finest Pork Minceā, which is a bit far from what I expected. Still a bit better compared to when I used the LLaVa-1.5 7B š
Now, letās see how GPT-4V (via ChatGPT) performs in this case!
In this particular case, we get:
The name of the product in the image is āAustralian Pork & Beef Bolognese Minceā.
That is actually NOT what I expected: I expected it to just respond with the specific product name, especially because its a premium commercial model.
Maybe for a lot of cases GPT-4V is better, but for this specific case and prompt, LLaVa-1.6 34B actually performed better considering my expectations.
How about Gemini Pro? In their benchmarks, itās already been established that LLaVa-1.6 performs better than Gemini Pro, but how does it perform in this specific task? Letās find out!
The free Google Bard uses Gemini Pro so it seems like the best way to try it out. Its response is:
The product in the image is Coles Australian Pork & Beef Bolognaise Mince..
I guess thereās a couple of issues here:
..
at the end?For this specific task, Gemini Pro via Google Bard does not perform as well as LLaVa-1.6 34B and GPT-4V.
llava.hliu.cc
If you want to use this in your projects, you can use an API like Replicate, but if youāre going to use an API, honestly you might as well use GPT-4V or Gemini Ultra instead.
As with most open source AI models, I tend to use Ollama, as Iāve shown earlier.
You can try it out on the console with this:
# 13B model
ollama run llava:13b-v1.6
# 34B model
ollama run llava:34b-v1.6
or using the Ollama API or the Python/JS libraries.
I also like trying out LLMs in a single file using llamafile, like I did with LLaVa-1.5, but unfortunately, it thereās no LLaVa-1.6 llamafile yet. Hopefully somebody makes one soon!
Based on the stats and the actual performance:
LLaVa-1.6 34B is a big step for open source multimodal models.
OCR-wise, and reasoning-wise (check out their demo!), LLaVa-1.6 34B performs better than I expected, close to the current commercial models (and even better in some cases)! Currently only the 34B model can claim this, as the 13B model isnāt up to par with expectations just yet.
So would I actually use this model? Thatās a BIG YES.
But would I actually use this in Production? No ā at least not yet.
For the current use cases I have in mind, its currently capabilities, which are LEAGUES beyond the previous version, LLaVA-1.5, are good enough for me.
The problem is the time it takes to respond.
Despite the speed up in the inference time, which I like a lot, this speed is still not suitable for production.
Even though the inference time is on par with the likes of GPT-4V, I wouldnāt use any model that takes this long to process, even commercial ones. If you need to process images in realtime, taking around 5 seconds to respond will be a significant bottleneck.
But for my personal tools and side projects ā I would definitely use LLaVA-1.6.
ā
Data/AI Engineer