OpenAI's new models can reason through complex tasks and solve harder problems than previous models in science, coding, and math.
fAter months of teasing on social media and hiding behind the codename “Project Strawberry,” the highly anticipated new language model from OpenAI is finally here — it’s called ‘o1’.
It’s a bit unconventional that they didn’t name it GPT-5 or GPT-4.1. So, why did they go with o1?
According to OpenAI, the advancements in these new models are so significant that they felt the need to reset the counter back to 1:
But for complex reasoning tasks this is a significant advancement and represents a new level of AI capability. Given this, we are resetting the counter back to 1 and naming this series OpenAI o1.
The main focus of these models is to think and reason through complex tasks and solve harder problems. So, don’t expect it to be lightning-fast; instead, it delivers better and more logical answers than previous models.
The o1 family of models come in two variants: o1-mini and the o1-preview.
OpenAI emphasizes that these new models are trained with reinforcement learning to perform complex reasoning. But what exactly does reasoning mean in the context of LLMs?
Much like how humans ponder for a while before answering a difficult question, o1 uses a chain of thought when attempting to solve a problem.
It learns to recognize and correct its mistakes. It learns to break down tricky steps into simpler ones. It learns to try a different approach when the current one isn’t working.
The key point is that reasoning allows the model to consider multiple approaches before generating final response.
Here’s the process:
Discarding reasoning tokens keeps context focused on essential information.
Note: While reasoning tokens are not visible via the API, they still occupy space in the model’s context window and are billed as output tokens.
This approach may be slow but according to NVIDIA’s senior researcher, Jim Fan, we are finally seeing the paradigm of inference-time scaling popularized and deployed in production.
Jim makes some excellent points:
To test how o1 models stack up against GPT-4o, OpenAI performed a diverse set of human exams and ML benchmarks.
The graph above demonstrates that o1 greatly improves over GPT-4o on challenging reasoning benchmarks involving math, coding, and science questions.
In evaluating OpenAI’s newly released o1 models, OpenAI discovered that they excel on the GPQA-diamond benchmark — a challenging intelligence test that assesses expertise in chemistry, physics, and biology.
To compare the model’s performance to that of humans, OpenAI collaborated with experts holding PhDs who answered the same GPQA-diamond questions.
Remarkably, o1 surpassed these human experts, becoming the first model to do so on this benchmark. While this doesn’t imply that o1 is superior to a PhD in all respects, it does indicate that the model is more proficient in solving certain problems that a PhD would be expected to solve.
You can read more about the technical report of o1 models here.
Now, to see how well o1 performs compared to the previous model, GPT-4o, let’s look at a classic problem: counting the number of ‘r’s in the word “strawberry.”
Prompt: How many ‘r’ letter are in the word strawberry?
Let’s try another one. This time, we’ll ask both models to come up with a list of countries with the letter ‘A’ in the third position in their names.Prompt: Give me 5 countries with letter A in the third position in the name
Again, o1 answered correctly, despite taking longer to ‘think’ than GPT-4o.
Even Sam Altman acknowledged that o1 is still flawed and limited. It might seem more impressive on first use than it does after you spend more time with it.
Sometimes, it can still make mistakes — even on simple questions like asking how many ‘r’s are in its response.
Another thing to note is that o1 models offer significant advancements in reasoning but are not intended to replace GPT-4o in all use cases.
For applications that need image inputs, function calling, or consistently fast response times, the GPT-4o and GPT-4o mini models will continue to be the right choice.
For developers, here are some of o1's chat completion API parameters that are not yet available:
temperature
, top_p
and n
are fixed at 1
, while presence_penalty
and frequency_penalty
are fixed at 0
.o1 is rolling out today in ChatGPT to all Plus and Team users, and in the API for developers on tier 5.
If you’re a free ChatGPT user, OpenAI mentioned that they’re planning to bring o1-mini access to all ChatGPT Free users, but no specific schedule was provided.
o1 is also available in the OpenAI Playground. Just login to https://platform.openai.com/ and under the Playground tab, set the model to either “o1-mini” or “o1-preview”.
There’s also the API models “o1-mini-2024–09–12” and the “o1-preview-2024–09–12” which are already accessible to developers.
If you’re used to your usual prompting with models like Claude 3.5 Sonnet, Gemini Pro, or GPT-4o, prompting o1 models is different.
o1 models perform best with straightforward prompts. Some prompt engineering techniques, like few-shot prompting or instructing the model to “think step by step,” may not enhance performance and can sometimes hinder it.
Check out some best practices:
Okay, so o1 is impressive when it comes to chat-based problem-solving and content generation. But do you know what I’m most excited about? Its integration into coding assistants like Cursor AI.
I’ve already seen folks plugging in their API keys into Cursor and using o1 to write code for them. I haven’t tried it yet, but I’m super excited to give it a go.
From my initial tests, o1’s ability to think, plan, and execute is off the charts. We’re basically witnessing a ChatGPT moment for agentic coding systems. The implications of its new capabilities are immense.
I genuinely believe that the wave of brand-new products that will be built with this will be unlike anything we’ve ever seen. The new possibilities in the world of software development are thrilling, and I can’t wait to see how o1 will revolutionize the way we code and build applications in the coming weeks.
Software engineer, writer, solopreneur