Bitcoin

Bitcoin

$108,871.83

BTC 0.34%

Ethereum

Ethereum

$2,630.67

ETH 3.02%

  • Login
  • Register
Metaverse Media Group
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
No Result
View All Result
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
No Result
View All Result
Metaverse Media Group

A Chinese firm has just launched a constantly changing set of AI benchmarks

A Chinese firm has just launched a constantly changing set of AI benchmarks

Techonolgy Reviewby Techonolgy Review
7 July 2025
image

When testing an AI model, it’s hard to tell if it is reasoning or just regurgitating answers from its training data. Xbench, a new benchmark developed by the Chinese venture capital firm HSG, or HongShan Capital Group, might help to sidestep that issue. That’s thanks to the way it evaluates models not only on the ability to pass arbitrary tests, like most other benchmarks, but also on the ability to execute real-world tasks, which is more unusual. It will be updated on a regular basis to try to keep it evergreen. 

This week the company is making part of its question set open-source and letting anyone use for free. The team has also released a leaderboard comparing how mainstream AI models stack up when tested on Xbench. (ChatGPT o3 ranked first across all categories, though ByteDance’s Doubao, Gemini 2.5 Pro, and Grok all still did pretty well, as did Claude Sonnet.) 

Development of the benchmark at HongShan began in 2022, following ChatGPT’s breakout success, as an internal tool for assessing which models are worth investing in. Since then, led by partner Gong Yuan, the team has steadily expanded the system, bringing in outside researchers and professionals to help refine it. As the project grew more sophisticated, they decided to release it to the public.

Xbench approached the problem with two different systems. One is similar to traditional benchmarking: an academic test that gauges a model’s aptitude on various subjects. The other is more like a technical interview round for a job, assessing how much real-world economic value a model might deliver.

Xbench’s methods for assessing raw intelligence currently include two components: Xbench-ScienceQA and Xbench-DeepResearch. ScienceQA isn’t a radical departure from existing postgraduate-level STEM benchmarks like GPQA and SuperGPQA. It includes questions spanning fields from biochemistry to orbital mechanics, drafted by graduate students and double-checked by professors. Scoring rewards not only the right answer but also the reasoning chain that leads to it.

DeepResearch, by contrast, focuses on a model’s ability to navigate the Chinese-language web. Ten subject-matter experts created 100 questions in music, history, finance, and literature—questions that can’t just be googled but require significant research to answer. Scoring favors breadth of sources, factual consistency, and a model’s willingness to admit when there isn’t enough data. A question in the publicized collection is “How many Chinese cities in the three northwestern provinces border a foreign country?” (It’s 12, and only 33% of models tested got it right, if you are wondering.)

On the company’s website, the researchers said they want to add more dimensions to the test—for example, aspects like how creative a model is in its problem solving, how collaborative it is when working with other models, and how reliable it is.

The team has committed to updating the test questions once a quarter and to maintain a half-public, half-private data set.

To assess models’ real-world readiness, the team worked with experts to develop tasks modeled on actual workflows, initially in recruitment and marketing. For example, one task asks a model to source five qualified battery engineer candidates and justify each pick. Another asks it to match advertisers with appropriate short-video creators from a pool of over 800 influencers.

The website also teases upcoming categories, including finance, legal, accounting, and design. The question sets for these categories have not yet been open-sourced.

ChatGPT o3 again ranks first in both of the current professional categories. For recruiting, Perplexity Search and Claude 3.5 Sonnet take second and third place, respectively. For marketing, Claude, Grok, and Gemini all perform well.

“It is really difficult for benchmarks to include things that are so hard to quantify,” says Zihan Zheng, the lead researcher on a new benchmark called LiveCodeBench Pro and a student at NYU. “But Xbench represents a promising start.”

Read the full article on TechnologyReview.com
in Technology
Reading Time: 3 mins read
0
0
20
VIEWS
Share on TwitterShare on Facebook

Subscribe to our newsletter

For the latest news & monthly prize giveaways
Join Now

Subscribe to our newsletter

For the latest news & monthly prize giveaways
Join Now
ADVERTISEMENT

Related Posts

Battling next-gen financial fraud 
Technology

Battling next-gen financial fraud 

18 hours ago
21
How scientists are trying to use AI to unlock the human mind 
Technology

How scientists are trying to use AI to unlock the human mind 

22 hours ago
22
Technology

Why little Lithuania has big plans for space tech

1 day ago
22

Comments

Please login to join discussion
ADVERTISEMENT

Latest News

  • All
  • Crypto
  • NFTs
  • Technology
  • Business
Pakistan Unveils Independent Crypto Regulator to Align With FATF and Global Standards
Crypto

Pakistan Unveils Independent Crypto Regulator to Align With FATF and Global Standards

Bitcoin.com News
by Bitcoin.com News
47 minutes ago
19
Tokenized Equities: Big Promise, Bigger Hurdles in the Race to Democratize Investing
Crypto

Tokenized Equities: Big Promise, Bigger Hurdles in the Race to Democratize Investing

Bitcoin.com News
by Bitcoin.com News
3 hours ago
21
Banking Giant BBVA Launches Crypto Trading for Retail Users in Spain
Crypto

Banking Giant BBVA Launches Crypto Trading for Retail Users in Spain

Bitcoin.com News
by Bitcoin.com News
5 hours ago
21
Linqto Files Bankruptcy Amid Legal Probes and Corporate Structure Issues
Crypto

Linqto Files Bankruptcy Amid Legal Probes and Corporate Structure Issues

Bitcoin.com News
by Bitcoin.com News
6 hours ago
20
Bitcoin Projected to Hit $459K by 2030, Surpassing $1M by 2035: Finder Panel
Crypto

Bitcoin Projected to Hit $459K by 2030, Surpassing $1M by 2035: Finder Panel

Bitcoin.com News
by Bitcoin.com News
7 hours ago
20
Bitcoin and Ether ETFs Open the Week With Strong Inflows
Crypto

Bitcoin and Ether ETFs Open the Week With Strong Inflows

Bitcoin.com News
by Bitcoin.com News
8 hours ago
20
Load More
Next Post
First celestial image unveiled from revolutionary telescope

First celestial image unveiled from revolutionary telescope

ADVERTISEMENT

Follow Us

Categories

  • Crypto
  • NFTs
  • AI
  • Technology
  • Business
  • Crypto
  • NFTs
  • AI
  • Technology
  • Business
Subscribe to our Newsletter

© 2022 Metaverse Media Group – The Metaverse Mecca

Privacy and Cookie Policy | Sitemap

Welcome Back!

Sign In with Google
OR

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Sign Up with Google
OR

Fill the forms below to register

*By registering into our website, you agree to the Terms & Conditions and Privacy Policy.
All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
Bitcoin

Bitcoin

$108,871.83

BTC 0.34%

Ethereum

Ethereum

$2,630.67

ETH 3.02%

  • Login
  • Sign Up
This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.

Subscribe to our newsletter

Get the latest news & win monthly prizes

Subscribe to our newsletter

For the Latest News and Monthly Prize Giveaways

Join Now
Join Now