OpenAI’s math gold hints that AI may soon tackle even longer and harder tasks

An unreleased AI model from OpenAI has reportedly solved five out of six problems from the International Mathematical Olympiad (IMO) under competition conditions. But the real story is not what it solved, but how it did it.

OpenAI says an experimental language model scored 35 out of 42 possible points in an internal IMO-style test – enough for a gold medal. Three former IMO winners independently graded the model’s natural language proofs, which were evaluated just like submissions from human contestants. According to the company, the test mirrored real IMO rules: two four-and-a-half-hour sessions, no internet, no external tools or code – just text.

OpenAI claims the model wasn’t specifically trained on IMO tasks. Instead, it was developed as a general-purpose reasoning model, drawing on recent advances in reinforcement learning and using substantial compute during inference. Researcher Alexander Wei emphasized in an X post that this was not a task-specific system, but one capable of autonomously generating complex, multi-page proofs. There are hints it might even be a multi-agent system.

Sustained reasoning without tools

What makes this achievement stand out is that the model reasoned consistently for hours at a time without any symbolic tools like code interpreters or mathematical software. That sets it apart from other high-performing systems such as DeepMind’s AlphaProof, which rely on hybrid neuro-symbolic approaches.

Until recently, it was widely believed that language models couldn’t sustain consistent mathematical reasoning over long sessions. As recently as June, mathematician Terence Tao said on the Lex Fridman Podcast that IMO-level problems were too difficult for AI to solve in real time. “You can’t hire enough humans to grade those,” Tao said, referring to the labor-intensive verification of long proofs in reinforcement learning training.

The result came as a surprise, even to prediction markets, which put the odds of an AI winning IMO gold before the end of 2025 at under 20 percent. (These forecasts used slightly stricter criteria.)

Both the markets and Tao seemed to assume that a reasoning model like o3 would need to be trained explicitly for IMO proofs, receiving expert feedback at every step. OpenAI, however, appears to have found a more general method for eliciting this behavior. Wei also highlighted that the model wasn’t tailored for the task, but instead was a generalist reasoning system.

OpenAI researcher Jerry Tworek says the reinforcement learning system used here also helped train ChatGPT Agent and the model that recently took second place at the Heuristics World Finals on AtCoder, where it generated code non-stop for nearly ten hours.

Transparency questions

As usual, OpenAI’s claims have sparked criticism. Gary Marcus called the achievement impressive but raised a list of questions in an X post: How is the model architecturally different from its predecessors? What were the costs per problem? Was the model trained on raw text or preprocessed data? And how transferable are these results to other scientific domains? So far, OpenAI has kept all those details under wraps.

Recommendation

OpenAI has faced similar criticism before, notably for a lack of transparency around the ARC-AGI benchmark test. The ARC Prize Foundation found that the final o3 model performed worse than a previously tested preview version. It also only came to light after the fact that OpenAI funded the supposedly independent FrontierMath benchmark, just after it hit a record result there.

A scalable approach to reasoning?

In a recent essay, “How o3 and Grok 4 accidentally vindicated neurosymbolic AI,” Marcus argued that modern AI models are increasingly relying on symbolic tools like code interpreters to overcome the limits of pure language models.

OpenAI’s IMO system, on the other hand, worked entirely in text – no tools – which, if the results hold up, would be a notable exception. If the model’s ability to generalize is confirmed, it could call Marcus’s thesis into question, at least in part. Still, his main criticism remains: without methodological transparency, it’s hard to interpret these achievements.

For now, OpenAI seems to have built a language model that can reason consistently for hours – without any external tools. That would have been hard to imagine just a short time ago. The generalist reasoning approach appears to scale, at least for now. According to OpenAI, the next step is reasoning sessions that last several days.

Join our community

Join the DECODER community on Discord, Reddit or Twitter – we can’t wait to meet you.

Read the full article on The-Decoder.com