OpenAI has accused DeepSeek of improperly using its data to train a competing AI model.
In the last couple of weeks, thereās been controversy surrounding DeepSeek, a Chinese AI startup, and its alleged use of OpenAIās proprietary models.
The issue came after DeepSeek released two models, DeepSeek-V3 and DeepSeek-R1, which achieve performance comparable to OpenAIās counterparts at significantly lower costs.
OpenAI has accused DeepSeek of improperly using its data to train a competing AI model. This accusation has sparked a heated debate about intellectual property rights in the AI sector and the ethics of model distillation.
Model distillation, also known as knowledge distillation, is a machine learning technique used to transfer knowledge from a large, complex model (the āteacherā) to a smaller, more efficient model (the āstudentā).
A distilled model is basically a smaller model that performs similarly to the larger one but requires fewer computational resources.
If youāre interested in knowing how an OpenAI model is distilled, check out this documentation.
In the fall of 2024, Microsoftās security researchers observed a group believed to be connected to DeepSeek extracting large amounts of data from OpenAIās API.
This activity raised concerns that DeepSeek was using distillation to replicate OpenAIās models without authorization. The excessive data retrieval was seen as a violation of OpenAIās terms and conditions, which restrict the use of its API for developing competing models.
According to Mark Chen, Chief Research Officer at OpenAI, DeepSeek managed to independently find some of the core ideas OpenAI had used to build its o1 reasoning model.
Chen noted that the reaction to DeepSeek, which caused NVIDIA to lose $650 billion in market value in a single day, might have been overblown.
However, I think the external response has been somewhat overblown, especially in narratives around cost. One implication of having two paradigms (pre-training and reasoning) is that we can optimize for a capability over two axes instead of one, which leads to lower costs.āāāMark Chen
While OpenAI has not disclosed the full details of the evidence, it has confirmed that there is substantial evidence indicating DeepSeek used distillation techniques to train its models.
In response to these findings, OpenAI and Microsoft blocked access to OpenAIās API for accounts suspected of being linked to DeepSeek. This move is part of a broader effort by the U.S. AI companies to safeguard their intellectual property and prevent unauthorized use of their models.
The controversy has also led to national security concerns, with the White House reviewing the implications of such practices on the U.S. AI industry.
Model distillation itself is not inherently illegal. It is a widely used technique in the AI industry to create more efficient models by transferring knowledge from a larger model to a smaller one.
Take the Stanford Alpaca model as an example. Alpaca is a language model fine-tuned using supervised learning from a LLaMA 7B model on 52K instruction-following demonstrations generated from OpenAIās text-davinci-003.
The data generation process results in 52K unique instructions and the corresponding outputs, which cost less than $500 using the OpenAI API.
It demonstrates how distillation can be used to create smaller, more affordable models that still perform well.
In fact, if you read DeepSeekās whitepaper, DeepSeek R-1 is a distilled model from Qwen (Qwen, 2024b) and Llama (AI@Meta, 2024).
To support the research community, we open-source DeepSeek-R1-Zero, DeepSeek-R1, and six dense models (1.5B, 7B, 8B, 14B, 32B, 70B) distilled from DeepSeek-R1 based on Qwen and Llama.
Based on DeepSeekās findings, it appears that this straightforward distillation method significantly enhances the reasoning abilities of smaller models.
The controversy stems from allegations that DeepSeek used OpenAIās model outputs to fine-tune their own models, which may be against OpenAIās terms of service. This raises questions about fair use, data ownership, and the competitive landscape in the AI industry.
To use DeepSeekās API, you need to run ānpm install openai.ā
Yep, you read that right. DeepSeek works with OpenAIās client libraries! This is possible because DeepSeekās REST API is fully compatible with OpenAIās API.
This is hilariously genius.
As a developer, this is actually a good thing, and I donāt see it as a huge problem because this is a common practice for LLM providers and aggregators. OpenRouter, Ollama, DeepInfra, and a bunch of others do this too.
Speaking of API access, according to Deepseek, you can use R1ās API at a fraction of what youād pay OpenAI.
The output token is almost 30 times cheaper than the $60 per million output tokens for o1. Thatās a huge cut in terms of cost for companies running large-scale AI operations.
Check out this visual comparison of DeepSeekās R1 to OpenAIās models.
Switching to the R-1 API would mean a huge savings. You can learn more about DeepSeekās API access here.
DeepSeek was barely known outside research circles until last month when it launched its v3 model. Since then, it has caused AI stocks to drop and even been called a ācompetitorā by OpenAIās CEO. Itās unclear how things will play out for DeepSeek in the coming months, but it has definitely caught the attention of both the public and major AI labs.
Ironically, it feels weird that OpenAI is accusing DeepSeek of IP theft, given their own history of copyright infringement. OpenAI gathered massive amounts of data from the internet to train its models, including copyrighted material, without seeking permission. This has led to lawsuits from authors like George R.R. Martin and Elon Musk (regarding Twitter data).
OpenAI may become even more closed off as a result. Do you remember the incident when Musk shut down free API access to X (formerly Twitter) due to data being stolen? Although thereās a thin chance that OpenAI will do the same, itās not unlikely to happen.
ā
Software engineer, writer, solopreneur