On Reasoning and OpenAI IMO Gold

Jul 21, 2025

Just a few days ago, OpenAI made an announcement that shocked the world, a new reasoning model they had just trained got gold medal at the IMO 2025 contest, while that itself is impressive, the magical part is, it did so without relying on automated theorem provers like lean, tool use, internet access or any form of augmentation. The model simply thought like a human would and wrote valid proofs to new unseen math problems in plain english language, and doing so within the same time constraint as a human and scoring gold with just 1 submission. This was a feat that even the most enthusiastic AI fans and researchers considered years away. In this post, I will review the fundamental ideas behind advancements in reasoning models, the challenges and some thoughts on the future of artificial intelligence.

Some Time Ago

To truly understand, what is going on in the AI world, we need to step back to May 13, 2024, that was just a little more than a year ago. It was the day OpenAI announced the first omni model, named GPT4o. This was the successor to GPT4 and it brought lots of improvements, in particular the ability to listen and speak like a human. GPT4o was made possible through advancements in pre-training multimodal LLMs (Tom Brown et al 2020). For context, the great age of LLMs we have today is made possible by the discovery that training a large transformer model to predict the next token on datasets of trillions of tokens (a token could be text, discretised (Aaron van den Oord et al, 2017) audio, image etc). The scaling laws (Jared Kaplan et al, 2020) showed that we can continually increase the overall intelligence and ability of LLMs to solve problems by jointly scaling up the model size, data size and the amount of compute used during training. While this led to amazing inventions like GPT4, Gemini, Claude etc, the relationship between the performance of the model and the scale follows a power law, this basically means, each jump in capability requires orders of magnitude more compute than the previous jump. For example, while the jump from GPT2 to GPT3 is much easier, the jump from GPT 3 to GPT4 requires significantly more compute and the jump from GPT4 to GPT 5 requires orders more compute than 3 - 4. As you scale this up more, at some point the amount of compute needed to go from one generation to another becomes extreme. GPT4.5 exemplified the challenges of scaling from one generation of LLMs to another.

With that in mind, unsurprisingly, new models like GPT4o brought new modalities but did not significantly advance the reasoning capabilities of LLMs. At the time, the following core issues plagued even the best LLMs.

They sucked at math
They were not great for agents
Their coding abilities were very limited

Back in the days, you could not get LLMs to reliably multiply 3 digit numbers correctly.

This was primarily due to limited reasoning capabilities of LLMs, while they had some ability to reason implicitly, lack of an explicit reasoning system made them struggle on many tasks.

The o1 Moment

In September 2024, less than a year ago, OpenAI unveiled the o1 series of models. Unlike the prior generation of LLMs, o1 was trained via reinforcement learning to think in plain english language before committing to a final answer. This allowed the model to:

Plan its response
Self correct and recover from mistakes
Explore multiple solutions, self evaluate and choose the best responses.

This led to a massive leap in capability of the models on domains such as maths, science, and coding.

Competition evals for Math (AIME 2024), Code (CodeForces), and PhD-Level Science Questions (GPQA Diamond)

As you can see from the benchmarks above, while GPT4o scored an abysmal 13.4 percent on the AIME 2024 contest, o1 got 83.8%, the same can be observed for codeforces, with a jump of 11% to 89%. Similarly, GPQA diamond which is a benchmark of PHD level science questions on physics, biology and chemistry saw an improvement from 56.1% to 78%. These are dramatic improvements that would have likely required scaling to like GPT-6 in the next token prediction pre-training paradigm to achieve.

Fundamentally, the way models like o1 was trained was by taking an existing foundation model like GPT4o, then training it via reinforcement learning to solve problems whose solutions can be verified (OpenAI 2024, DeepSeek 2025). During the RL training, there are two primary rewards.

A reward that forces the model to think before responding. An example is having the model generate its answers as <think> some thoughts </think> final answer, if the model fails to follow this format, it receives a reward of 0, and if it does, it receives a reward of 1.
The second reward is on the correctness of the models final answer, for maths questions, this will be like a reward of 1 if the model answered the question correctly and 0 if it failed. Notably, no reward is applied to the contents of the <think> </think>, which means the model’s thought process is unrestricted and it has maximum freedom to generate the chain of thought that enables it to answer correctly.

One important thing to note from above is that due to the chain of thought being completely unrestricted, any specific actions the model takes during its thinking process including back tracking, correcting mistakes, planning etc are all emergent behaviours rather than something that is human coded.

Additionally, this process is remarkably data efficient, requiring much less training data, more of the focus is on data quality. Much of the training data for o1 included domains where the final answers could be easily verified, like algebra, competitive programming and other programming problems where you can use unit tests to verify correctness.

To understand intuitively why this approach yielded such dramatic improvements, we need to look into the work of Nobel Prize Winning Psychologist, Daniel Kahneman. In his 2011 book, “Thinking, Fast and Slow”, he explained that human thinking can be categorized into two classes, system 1 thinking and system 2 thinking.

System 1 thinking is fast and instinctive, like asking a person to calculate 2 + 2, or “How are you”, or something like a pen dropping and you instantly try to pick it up. These are decisions that are made in a flash without much thinking. Prior large language models such as GPT4 basically operate in system 1 mode all the time, you ask them a question and they just jump straight to an answer.

System 2 thinking is slow and deliberative, often involving planning and considering multiple action paths. Examples of this include deriving a mathematical proof, writing a research paper, multiply 3 digit numbers, designing a new piece of software etc. Without proper thinking, it is almost impossible to do this tasks accurately, often , the longer the thinking time devoted to this, the better the action taken. Harder tasks often require longer and smarter thinking while easier tasks require less thinking. Reasoning models like o1 are designed to operate in system 2 mode, by training them to reason explicitly before taking an action.

One question that comes to mind is, “Can’t we just tell the model to reason before answering”, the answer is a small yes and a BIG NO! In 2022, Jason Wei et al published the famous chain of thought paper, where they showed that by asking an LLM to think before responding, it’s performance improves on many STEM tasks. However, the performance improvements were nowhere as dramatic as that achieved by o1. Jason Wei later worked on o1, and the key difference in o1 is that the model via RL learned to generate optimal chain of thoughts that enabled it maximize its performance on various tasks.

New Scaling Laws

The image shows two scatter plots comparing "o1 AIME accuracy" during training and at test time. Both charts have "pass@1 accuracy" on the y-axis and compute (log scale) on the x-axis. The dots indicate increasing accuracy with more compute time.

With o1 came 2 new scaling laws that pointed to the imminent acceleration in model capabilities. The first scaling law is rather obvious and it basically states that the accuracy would continue to improve as the RL training is scaled up. Unlike pretraining, scaling up the RL training takes less compute and can rapidly yield new improvements that would have otherwise taken so much more pretraining compute to achieve.

The second scaling law is the more remarkable discovery. It states that accuracy increases as the thinking time of the model increases. This led to the so-called test time compute paradigm where we can simply increase the inference compute assigned to a problem in order to solve a harder task rather than needing to train a new stronger model.

Both of these new scaling laws offered a way out of the data wall problem, as continually scaling pretraining required a lot more data than is available today because we are already at the limits of current publicly available internet data.

Improvements and Challenges with o1

o1 sparked a wave of new research and in the months after its announcement, almost every new frontier model adopted this approach, leading to rapid advancements in model capabilities across coding, maths, agentic workflows, multimodal, writing, etc.

o3 and o4-mini brought a wave of great improvements including adding function calls within the chain of thought allowing models to, for example, look up web information or execute python code within their thinking process.

New models such as o3, Gemini 2.5 Pro, Claude 4, Grok 4 and the latest DeepSeek R1 are all improved versions of the original o1 approach, with a core focus on improving beyond easy to verify domains, longer thinking and enabling agents that can work for a long period of time without losing context.

However, despite all the improvements, they work best on domains where answers are easy to verify, and due to the fact that their reward is based on the final answer being correct, they sometimes solve problems correctly without actually following the right approach. This is very evident in formal proofing such as the IMO contest where the scores are not based solely on the final answer but on the correctness of every step leading to the final answer. Solving such problems requires being able to train reasoning models on domains that are hard to verify. For example, after the recent IMO 2025 contest, current publicly available frontier LLMs including Gemini 2.5 Pro, o3, Grok 4 and o4-mini were evaluated and they performed poorly, failing to achieve even a bronze medal. However, in a dramatic turn of events, a few hours after this finding was published, OpenAI announced that a new breakthrough in how they train reasoning models had enabled them to build a reasoning model that achieved gold with just an autoregressive llm, thinking and working in plain english, no formal proofing language used, no tool call, no internet access, same time limit as humans and just 1 submission. It has been a earth shattering announcement, altering every timeline on when AGI and Super Intelligence will arrive. While details are scarce at the moment on how it was achieved, I will explain a bit on what we know so far from public posts from OpenAI researchers.

A New Research Breakthrough

At the core of the IMO gold result is a new research breakthrough by a small team of OpenAI researchers, Alexandar Wei, Sheryl Hsu and Noam Brown. They invented a new approach that was focused on training the model to reason on hard-to-verify domains. This is quite important because while a number of problems can be easily verified, a whole lot of intelligent tasks are not so easy to verify. For example, you can verify a software passes unit tests, but that is just small part of software dev, you want the software to provide good UX, be built in an easy-to-maintain way, the codebase should follow best practices, should be easy to onboard to etc. Same is true for most work in law, finance, scientific research, etc. A new training algorithm that enables an AI model to correctly learn in hard-to-verify domains opens up the floodgates to make LLMs work well in a wide range of scenarios where they currently fail.

Notably, the model used in the IMO competition is not some model specifically trained for math proofing rather it is a general purpose LLM that can also do math proofing at the level of human experts.

Furthermore, while the IMO results got a lot of attention, the same new breakthrough was used to train a model that enabled a new level of coding ability in LLMs significantly ahead of what is publicly available today. Last week at the AtCoder World Tour Finals Heuristic, in Tokyo, Japan OpenAI competed with top human coders in a tough coding test, live, with no human intervening in the actions of the model. The model worked autonomously for 10 hours, beating most of the humans, it ended up finishing in second place, beaten only by PsyHo , who hard to battle the model tirelessly, with very little sleep, holding the line for humans.

Przemysław Dębiak beat an AI tool to win the AtCoder World Tour Finals 2025 Heuristic Contest in Tokyo. Agata Jaskulska

Very recently, openai released its ChatGPT Agent. This used an early version of the new algorithm and an older base model. It’s a powerful agent that can perform most takes a human can do on a computer. It effectively combines operator and deep research and can work for much longer time.

In summary OpenAI has made a new breakthrough that;

Enables general purpose RL on domains that are hard to verify
Enables models to think for several hours to solve a task

This breakthrough is quite recent and will not be seen in GPT-5 which will be released soon. However, in the next several months (hopefully by year end) OpenAI will be releasing a new model (some GPT 5.X) that will incorporate this new advancement. We will hopefully learn more about the exact algorithm that enables this new breakthrough in the coming months.

Final Thoughts

Super Intelligence has never been closer. The advancement from GPT4o in May 2024 to a general purpose LLM winning IMO Gold in July 2025 is straight out of science fiction. RL is the key to unlocking Super Intelligence and much of the work remaining is in building great simulations of many real world tasks. Anything that we can simulate can be solved. Autoregressive GPT models might have appeared to be basic text completion engines to many in the days of GPT2 but as LLMs evolve it is clear that what is derived from next token prediction is something much more powerful, akin to the same process that occurs in the human brain.

To finish, I will leave with this great quote from Ilya Sutskever.

“What does it mean to predict the next token well enough? It is a deeper question than it seems. Predicting the next token well means you understand the underlying reality that led to the creation of that token…”

John Olafenwa’s Newsletter

Discussion about this post