OpenAI text embeddings are amazing, but not the best in every category
OpenAI just released their new embedding models, and it was a long time coming. Their previous embedding model was clearly lagging behind the times.
If you’ve read the announcement post, you’ll know they released two new embedding models: text-embedding-3-small and text-embedding-3-large.
They come with better pricing, and even better performance.
But we all know they’re not the only text embedding models out there. It’s wild world out there. As someone who prefers open source if I can help it (mostly coz I’m broke as hell), I wanted to compare these new OpenAI models with the crème de la crème of the open source text embeddings.
For this comparison, we’ll look at the following:
and we will be comparing them on the following criteria/stats:
With the criteria and contenders out of the way, here is the comparison:
One thing we can notice immediately is that OpenAI’s new text-embedding-3-large model is only the second best performing model in this list with a score of 65.59.
The best performing model here is actually the E5 model with a score of 66.63.
Meanwhile, Jina has the worst performance of 60.38.
Winner: E5
We can see the same type of result with embedding dimensions as well, with text-embedding-3-large being the second best with3072
dimensions, the E5 model being the top again with4096
dimensions, and the Jina model being the worst of the bunch with768
dimensions — which makes you wonder why I even put it in this list?
That’s because …
The Jina model is actually the first open source text embedding model to have 8k max input token length. That’s why I felt it deserved a spot on this list. For this criteria, it’s on par with OpenAI’s new text embedding models.
But then again, they’re a tie for second place, as the E5 model once again reigns supreme in this category with a 32k max input token length.
Winner: E5
Less explanations needed here. The open source models cost nothing to use, but the pricing for the new OpenAI embedding models are actually pretty good! Jina also has an API, but unless you bulk buy tokens, the OpenAI text embeddings are actually cheaper.
At a glance, based on the comparisons, the E5 is the best model that we have looked into today. But there’s another consideration:
The e5-mistral-7b-instruct is a heckin chonk of a model.
It doesn’t matter if it’s a really good model if you can’t use it. I’m unable to run this model at all with my smol Macbook M1 Air.
I have never used the OpenAI embeddings because of the great open source options out there, but I’m actually interested to try the new models because of their improved cost and performance, especially for commercial work.
But also, there’s…
but that’s a topic for another time 😁
But what do you think? I would love to hear your thoughts on this!
Data/AI Engineer