A Stanford professor has spent the past year testing the same unsolved math problem on OpenAI’s models, unintentionally tracking their progress in self-assessment along the way.
“I’ve actually been emailing with the Stanford mathematics professor. He emailed me about a year ago before we announced o1 and said, ‘Hey, do you want to do a collaboration on solving hard math problems?’ Basically, I told him I think we just have to advance general reasoning capabilities, and eventually they’re going to be able to help you with your hard math problems. I think that’s actually the most promising route to getting there. He was a little skeptical, but with every model release, every reasoning model release, he emails me with a follow-up and asks, ‘Can it solve this problem now?’ I plug them in and send him the output, and he says, ‘Yeah, that’s wrong,'” recalls Noam Brown of OpenAI.
But after OpenAI’s recent breakthrough at the International Mathematical Olympiad, something important has changed: “He emailed me a follow-up this time with the same problem, asking, ‘Hey, can it solve it now?’ It still can’t solve it, but at least this time it recognizes that it can’t solve it, so I think that’s a big step.” Instead of hallucinating, the model simply said “no answer” to this year’s hardest IMO problem. As Brown puts it, “I think it was good to see the model doesn’t try to hallucinate or just make up some solution, but instead will say ‘no answer.'”
This accidental long-term study reveals a kind of progress that standard benchmarks have missed: the models might be getting a little better at recognizing their own limitations, rather than generating confident but wrong answers.
A Spanish research team takes a similar view of the much-discussed results from Apple’s reasoning study. There too, reasoning models like o3 stopped their output prematurely. While Apple’s researchers saw this as a simple failure, the Spanish team argues it’s evidence of a learned strategy: the models realize they’ve hit a wall and stop.
It will likely be some time before we’re fully protected from AI-generated bullshit. OpenAI plans to make its IMO model available to mathematicians for testing, but the core improvements behind this progress are not expected to appear in commercial models for several more months.