Bitcoin

Bitcoin

$117,516.18

BTC -0.29%

Ethereum

Ethereum

$3,744.26

ETH 4.62%

  • Login
  • Register
Metaverse Media Group
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
No Result
View All Result
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
No Result
View All Result
Metaverse Media Group

Alibaba’s Qwen2.5 only excels at math thanks to memorized training data

Alibaba’s Qwen2.5 only excels at math thanks to memorized training data

The Decoderby The Decoder
20 July 2025
A new study finds that Alibaba’s Qwen2.5 models achieve high math scores mainly by memorizing training data rather than through genuine reasoning. The article Alibaba’s Qwen2.5 only excels at math thanks to memorized training data appeared first on THE DECODER….

summary
Summary

A new study finds that Alibaba’s Qwen2.5 models achieve high math scores mainly by memorizing training data rather than through genuine reasoning.

Researchers discovered that what appears to be progress in mathematical reasoning is largely due to data contamination. When tested on “clean” benchmarks that the model had not seen during training, Qwen2.5’s performance dropped sharply.

To test this, the team gave Qwen2.5 only the first 60 percent of problems from the MATH 500 benchmark and asked it to complete the rest. Qwen2.5-Math-7B managed to reconstruct the missing 40 percent with 54.6 percent accuracy and answer correctly 53.6 percent of the time. In comparison, Llama3.1-8B managed just 3.8 and 2.4 percent. This suggests Qwen2.5 had already encountered these problems during training.

Comparison of EM and ROUGE-L results of three models on six math datasets at 80%, 60%, and 40% prompt lengths.
Qwen2.5-Math-7B can accurately reconstruct missing sections of MATH 500 benchmark problems, indicating it has likely seen the data before. | Image: Wu et al.
Recommend our article

The researchers then tested the model with LiveMathBench (version 202505), a “clean” benchmark released after Qwen2.5. On this dataset, Qwen2.5’s completion rate dropped to zero, matching Llama, and its answer accuracy fell to just two percent.

THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Cancel at any time

The likely reason is that Qwen2.5 was pre-trained on large online datasets, including GitHub repositories containing benchmark problems and their solutions. As a result, even random or incorrect reward signals during training could improve its results on MATH-500 because of its prior exposure to the data.

Bar chart: MATH-500 accuracy of Qwen2.5 and Llama-3.1 models with greedy/Average@16 decoding, with/without template.
Qwen2.5 models show steep performance drops on MATH-500 when response templates are changed, while Llama-3.1-8B remains almost unaffected. | Image: Wu et al.

To address this, the team created the RandomCalculation dataset, containing fully synthetic arithmetic problems generated after Qwen2.5’s release. On these new problems, Qwen2.5’s accuracy declined as problem complexity increased. Only correct reward signals improved performance, while random rewards made training unstable and inverted rewards degraded math skills.

Four line graphs: accuracy vs. calculation step for Qwen2.5-Math-7B and -7B-Instruct with/without template and Greedy/Avg@16 decoding.
All Qwen2.5 variants lose accuracy as the number of calculation steps increases on synthetic math problems. | Image: Wu et al.

Controlled RLVR (Reinforcement Learning with Verifiable Rewards) experiments confirmed these results: only correct rewards led to stable improvement, while random or inverted rewards failed to boost performance or actively degraded it.

These findings call into question the idea that Qwen2.5’s math abilities reflect real reasoning. Instead, the results show the model relies heavily on memorized data.

Alibaba launched Qwen2.5 in September 2024, followed by the Qwen3 series. Whether these findings also apply to Qwen3 remains to be seen.

Recommendation

The study’s authors warn that contaminated benchmarks can lead to misleading conclusions about AI progress. They recommend future research rely on clean, uncontaminated benchmarks and evaluate multiple model series for more reliable results.

Benchmark gaming isn’t new

The results highlight how difficult it is to separate true reasoning from memorization in large language models, and why rigorous, clean evaluation methods are essential for trustworthy AI research. Previous work has shown that benchmarks can be manipulated or “gamed.”

For example, Meta submitted a version of Llama 4 specifically tuned to perform well on the LMArena benchmark by using customized response formats. Other studies show that models like Gemini 2.5 Pro and Claude 3.5 Sonnet can identify test scenarios with up to 95 percent accuracy and adjust their responses, raising even broader questions about the validity of current evaluation methods.

Join our community
Join the DECODER community on Discord, Reddit or Twitter – we can’t wait to meet you.

Read the full article on The-Decoder.com
in AI
Reading Time: 4 mins read
0
0
21
VIEWS
Share on TwitterShare on Facebook

Subscribe to our newsletter

For the latest news & monthly prize giveaways
Join Now

Subscribe to our newsletter

For the latest news & monthly prize giveaways
Join Now
ADVERTISEMENT

Related Posts

FlexOlmo enables organizations to collaboratively train LLMs without data sharing
AI

FlexOlmo enables organizations to collaboratively train LLMs without data sharing

16 hours ago
21
An OpenAI AI model finished second in the AtCoder Heuristics World Finals
AI

An OpenAI AI model finished second in the AtCoder Heuristics World Finals

1 day ago
22
OpenAI claims a breakthrough in LLM reasoning on complex math problems
AI

OpenAI claims a breakthrough in LLM reasoning on complex math problems

1 day ago
22

Comments

Please login to join discussion
ADVERTISEMENT

Latest News

  • All
  • Crypto
  • NFTs
  • Technology
  • Business
ETH Tsunami Incoming: Firms Quietly Amass Mountains of Ethereum
Crypto

ETH Tsunami Incoming: Firms Quietly Amass Mountains of Ethereum

Bitcoin.com News
by Bitcoin.com News
1 hour ago
21
Altcoin Season Is Popping: Index Clocks 51 as Hype Hits a Boiling Point
Crypto

Altcoin Season Is Popping: Index Clocks 51 as Hype Hits a Boiling Point

Bitcoin.com News
by Bitcoin.com News
3 hours ago
22
10 Leading AI Chatbots Predict Bitcoin’s Wild Ride to $1 Million
Crypto

10 Leading AI Chatbots Predict Bitcoin’s Wild Ride to $1 Million

Bitcoin.com News
by Bitcoin.com News
5 hours ago
21
Ethereum Rockets Past $3.7K as Options Traders Eye $12K Moonshot Bets
Crypto

Ethereum Rockets Past $3.7K as Options Traders Eye $12K Moonshot Bets

Bitcoin.com News
by Bitcoin.com News
8 hours ago
22
Bettors Bet Big on Vance and Newsom: 2028 US Election Race Heats up Before It Even Starts
Crypto

Bettors Bet Big on Vance and Newsom: 2028 US Election Race Heats up Before It Even Starts

Bitcoin.com News
by Bitcoin.com News
8 hours ago
22
Alibaba’s Qwen2.5 only excels at math thanks to memorized training data
AI

Alibaba’s Qwen2.5 only excels at math thanks to memorized training data

The Decoder
by The Decoder
10 hours ago
21
Load More
Next Post
Bettors Bet Big on Vance and Newsom: 2028 US Election Race Heats up Before It Even Starts

Bettors Bet Big on Vance and Newsom: 2028 US Election Race Heats up Before It Even Starts

ADVERTISEMENT

Follow Us

Categories

  • Crypto
  • NFTs
  • AI
  • Technology
  • Business
  • Crypto
  • NFTs
  • AI
  • Technology
  • Business
Subscribe to our Newsletter

© 2022 Metaverse Media Group – The Metaverse Mecca

Privacy and Cookie Policy | Sitemap

Welcome Back!

Sign In with Google
OR

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Sign Up with Google
OR

Fill the forms below to register

*By registering into our website, you agree to the Terms & Conditions and Privacy Policy.
All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
Bitcoin

Bitcoin

$117,516.18

BTC -0.29%

Ethereum

Ethereum

$3,744.26

ETH 4.62%

  • Login
  • Sign Up
This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.

Subscribe to our newsletter

Get the latest news & win monthly prizes

Subscribe to our newsletter

For the Latest News and Monthly Prize Giveaways

Join Now
Join Now