Key Points
- British mathematician Timothy Gowers used OpenAI’s ChatGPT 5.5 Pro model to tackle open problems in number theory, with the AI producing complete scientific papers in under two hours, without any mathematical guidance from Gowers himself.
- According to Gowers, the AI’s output reached “PhD-level” and managed to improve upon existing mathematical bounds, demonstrating a remarkable degree of independent mathematical reasoning.
- Isaac Rajagopal, a young researcher involved in the work, called the model’s key idea “completely original,” an achievement he said a human mathematician would be proud of after weeks of deliberation.
British mathematician Timothy Gowers had ChatGPT 5.5 Pro tackle open problems in number theory. The model significantly improved an existing mathematical bound. One of the junior researchers involved calls the model’s key idea “completely original.”
Fields Medalist Timothy Gowers writes in his blog that ChatGPT 5.5 Pro has produced a piece of doctoral-level mathematical research, and that his own mathematical contribution was zero. The model did all the work in under two hours. “I didn’t even do anything clever with the prompts,” Gowers writes.
The mathematician, who holds the Combinatorics Chair at the College de France and is a Fellow at Trinity College Cambridge, fed the model open problems from a paper by number theorist Mel Nathanson. The paper investigates the possible sizes of certain sets of integer sums and how efficiently sets with prescribed properties can be constructed.
ChatGPT 5.5 Pro cracked an open math problem in 17 minutes
Nathanson had proved an exponential bound for one of the problems and asked whether it could be improved. According to Gowers, ChatGPT 5.5 Pro thought for 17 minutes and 5 seconds, then delivered the best possible construction with a quadratic bound. The core idea: the model swapped out a component in Nathanson’s proof for a more efficient variant that’s well known in combinatorics but whose application to this particular problem wasn’t obvious.
When asked, ChatGPT rewrote the argument as a LaTeX preprint in 2 minutes and 23 seconds. Gowers checked it for correctness, then had the model solve a related variant, which it handled without any issues. Both results are available as a preprint.
A generalized version of the problem proved much harder. Here, there was prior work by Isaac Rajagopal, an MIT student who had proven an exponential dependency. Gowers gave ChatGPT Rajagopal’s paper and asked for an improvement.
What followed was a gradual escalation: after 16 minutes and 41 seconds, the model delivered a first improvement. Rajagopal judged this step correct but called it a routine modification of his own work. Gowers then got, as he puts it, “greedy” and asked ChatGPT to try for a much stronger bound.
After 13 minutes and 33 seconds, the model reported optimism but said two technical statements still needed checking. Another 9 minutes and 12 seconds later, the check was done. The finished preprint was ready in 31 minutes and 40 seconds. The model had improved the bound from exponential to polynomial.
“The sort of idea I would be very proud to come up with after a week or two of pondering”
According to Gowers, Rajagopal declared the results are “almost certainly correct,” both at the level of individual proof steps and the underlying ideas.
Rajagopal’s assessment is nuanced: the first improvement was a “routine modification” of his own work. The improvement to a polynomial bound, though, was “quite impressive.”
Rajagopal calls the model’s key idea “quite ingenious.” It found a counterintuitive way to compress certain algebraic structures so they fit into a much smaller number range without losing their crucial combinatorial properties.
“It is the sort of idea I would be very proud to come up with after a week or two of pondering, and it took ChatGPT less than an hour to find and prove, using similar methods to those in my own proof,” Rajagopal writes. As far as he could tell, the idea was “completely original.”
The bar for mathematicians is now proving what LLMs cannot prove
Gowers puts the result at the level of “a perfectly reasonable chapter in a combinatorics PhD,” stating that it’s not an “amazing result,” since it builds heavily on Rajagopal’s ideas, but it’s “definitely a non-trivial extension.” For a PhD student, it would have taken considerable time to work through Rajagopal’s paper, identify weaknesses, and adapt the techniques, Gowers says.
He draws far-reaching conclusions: “The lower bound for contributing to mathematics will now be to prove something that LLMs can’t prove, rather than simply to prove something that nobody has proved up to now and that at least somebody finds interesting.” He does qualify this, though: PhD students could use LLMs as a tool. The real task will then be to create something in collaboration with LLMs that the models can’t do alone.
Gowers poses a thought experiment: “Suppose that a mathematician solved a major problem by having a long exchange with an LLM in which the mathematician played a useful guiding role but the LLM did all the technical work and had the main ideas. Would we regard that as a major achievement of the mathematician? I don’t think we would.”
Still, he sees value in the struggle of doing math yourself. Those who have solved difficult problems on their own gain insights into the problem-solving process that you simply can’t get from reading. “Just as very good coders are better at vibe coding than not such good coders,” Gowers writes. His prediction: anyone starting a doctorate today and finishing in 2029 at the earliest will see mathematical research “changed out of all recognition” by then.
This echoes the vision of star mathematician Terence Tao, who described an “industrial-scale mathematics” powered by AI tools, where large teams with AI support conduct broad-based research instead of lone wolves working on narrow problems for years.
At the time, though, Tao compared AI models to “mediocre, but not completely incompetent” research assistants. Gowers’ experience with ChatGPT 5.5 Pro suggests that assessment may already be outdated. Tao’s latest comments have also been far more positive.
Generative AI keeps pushing deeper into mathematics
An early example of AI in math research was the use of GPT-5 as a research tool. OpenAI researchers claimed a GPT model had “found” the solution to an Erdos problem. In reality, the AI had merely tracked down an existing solution in the literature and hadn’t developed its own proof.
A clear leap came when GPT-5.2 Pro solved Erdos problem #728 “more or less autonomously,” according to Tao. No corresponding solution could be found in the existing literature. Then GPT-5.4 Pro went further, solving a longstanding open Erdos problem.
Progress showed up in other fields, too. In December 2025, a physicist published a paper whose central idea came from GPT-5. The author expects hybrid human-AI collaborations to become standard in mathematics, physics, and other formal sciences before long. As large language models grow more precise, they could increasingly function as autonomous research agents.
Why jumping to conclusions is risky
Google Deepmind has seen both breakthroughs and sobering failure rates with its AI agent Aletheia. The system, built on Gemini Deep Think, independently wrote a math paper, disproved a decades-old assumption, and uncovered an error in a cryptography paper. But when researchers systematically tested it on 700 open math problems, only 6.5 percent of its answers turned out to be usable.
Tao has been making a similar point consistently. Erdos problems vary in difficulty by “several orders of magnitude,” he notes. Just because a problem is 50 years old and an AI solves it doesn’t mean it resisted all human efforts for half a century. Often, no one seriously tackled it.
AI News Without the Hype – Curated by Humans
Subscribe to THE DECODER for ad-free reading, a weekly AI newsletter, our exclusive “AI Radar” frontier report six times a year, full archive access, and access to our comment section.


