Bitcoin

Bitcoin

$118,943.34

BTC 1.45%

Ethereum

Ethereum

$2,996.27

ETH 2.42%

  • Login
  • Register
Metaverse Media Group
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
No Result
View All Result
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
No Result
View All Result
Metaverse Media Group

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises

The Decoderby The Decoder
13 July 2025
YouTube fail videos reveal a major blind spot for leading AI models: they struggle with surprises and rarely reconsider their first impressions. Even advanced systems like GPT-4o stumble over simple plot twists. The article Researchers used 1,600 YouTube fail videos to show AI models struggle with surprises appeared first on THE DECODER….

summary
Summary

YouTube fail videos reveal a major blind spot for leading AI models: they struggle with surprises and rarely reconsider their first impressions. Even advanced systems like GPT-4o stumble over simple plot twists.

Researchers from the University of British Columbia, the Vector Institute for AI, and Nanyang Technological University put top AI models through their paces using more than 1,600 YouTube fail videos from the Oops! dataset.

The team created a new benchmark called BlackSwanSuite to test how well these systems handle unexpected events. Like people, the AI models are fooled by surprising moments—but unlike people, they refuse to change their minds, even after seeing what really happened.

One example: a man swings a pillow near a Christmas tree. The AI assumes he’s aiming at someone nearby. In reality, the pillow knocks ornaments off the tree, which then hit a woman. Even after watching the whole video, the AI sticks to its original, incorrect guess.

THE DECODER Newsletter
The most important AI news straight to your inbox.
✓ Weekly
✓ Cancel at any time

The videos span a range of categories, with most featuring traffic accidents (24 percent), children’s mishaps (24 percent), or pool accidents (16 percent). What unites them all is an unpredictable twist that even people often miss.

Three types of tasks

Each video is split into three segments: the setup, the surprise, and the aftermath. The benchmark challenges LLMs with different tasks for each stage. In the “Forecaster” task, the AI only sees the start of the video and tries to predict what comes next. The “Detective” task shows only the beginning and end, asking the AI to explain what happened in between. The “Reporter” task gives the AI the full video and checks whether it can update its assumptions after seeing the full story.

Diagram of the Forecaster, Detective, and Reporter video tasks with pre-, main, and post-event phases and hidden video segments
The benchmark includes 15,469 questions across all three video-based tasks. | Image: Chinchure et al.
Recommend our article

The tests covered both closed models like GPT-4o and Gemini 1.5 Pro, as well as open-source systems such as LLaVA-Video, VILA, VideoChat2, and VideoLLaMA 2. The results highlight glaring weaknesses. On the detective task, GPT-4o answered correctly just 65 percent of the time. By comparison, humans got 90 percent right.

Table with MCQ and yes/no values for detective and reporter tasks performed by GPT-4o, Gemini, open-source models, and humans
The table compares closed and open models with human performance on multiple-choice and yes/no versions of the detective and reporter tasks. | Image: Chinchure et al.

The gap widened even further when models needed to reconsider their initial guesses. When asked to revisit their predictions after seeing the entire video, GPT-4o managed only 60 percent accuracy – 32 percentage points behind humans (92 percent). The systems tended to double down on their first impressions, ignoring new evidence.

Other models, like Gemini 1.5 Pro and LLaVA-Video, showed the same pattern. According to the researchers, performance dropped sharply on videos that even people found tricky the first time through.

Recommendation

Garbage trucks don’t drop trees, do they?

The root of the problem lies in how these AI models are trained. They learn by spotting patterns in millions of videos and expect those patterns to repeat. So when a garbage truck drops a tree instead of picking up trash, the AI gets confused—it has no pattern for that.

Screenshot with sequence images (V_pre, V_main, V_post), question, and GPT-4o reasoning, which incorrectly selects option A instead of the correct option B.
GPT-4o follows its initial hunches and picks the wrong answer. | Image: Chinchure et al.

To pinpoint the issue, the team tried swapping out the AI’s video perception for detailed human-written descriptions of the scenes. This boosted LLaVA-Video’s performance by 6.4 percent. Adding even more explanations bumped it up by another 3.6 percent, for a total gain of 10 percent.

Ironically, this only underscores the models’ weakness: If the AI performs well only when humans do the perceptual heavy lifting, it fails at “seeing” and “understanding” before any real reasoning starts.

Humans, by contrast, are quick to rethink their assumptions when new information appears. Current AI models lack this mental flexibility.

Join our community
Join the DECODER community on Discord, Reddit or Twitter – we can’t wait to meet you.

This flaw could have serious consequences for real-world applications like self-driving cars and autonomous systems. Life is full of surprises: children dash into the street, objects fall off trucks, and other drivers do the unexpected.

The research team has made the benchmark available on Github and Hugging Face. They hope others will use it to test and improve their own AI models. As long as leading systems are tripped up by simple fail videos, they’re not ready for the unpredictability of the real world.

Read the full article on The-Decoder.com
in AI
Reading Time: 4 mins read
0
0
21
VIEWS
Share on TwitterShare on Facebook

Subscribe to our newsletter

For the latest news & monthly prize giveaways
Join Now

Subscribe to our newsletter

For the latest news & monthly prize giveaways
Join Now
ADVERTISEMENT

Related Posts

AI system StreamDiT generates livestream videos from text at 16 fps 512p
AI

AI system StreamDiT generates livestream videos from text at 16 fps 512p

4 hours ago
20
Grok 4 is not officially instructed to follow Musk’s views but often does on sensitive subjects
AI

xAI says it wants to fix Grok 4 because referencing Musk’s views is not right for a truth-seeking AI

6 hours ago
21
Elon Musk’s AI company xAI apologizes “deeply” for Grok’s “horrific behavior”
AI

Elon Musk’s AI company xAI apologizes “deeply” for Grok’s “horrific behavior”

6 hours ago
21

Comments

Please login to join discussion
ADVERTISEMENT

Latest News

  • All
  • Crypto
  • NFTs
  • Technology
  • Business
BTC’s $118K Rally Wipes out $1B in Shorts, Canadian Woman Sues Over Sim-Swap Scam, and More — Week in Review
Crypto

BTC’s $118K Rally Wipes out $1B in Shorts, Canadian Woman Sues Over Sim-Swap Scam, and More — Week in Review

Bitcoin.com News
by Bitcoin.com News
41 minutes ago
19
Bitcoin Rocket Ship Blasts Past $119K as Bull Run Accelerates
Crypto

Bitcoin Rocket Ship Blasts Past $119K as Bull Run Accelerates

Bitcoin.com News
by Bitcoin.com News
1 hour ago
21
XRP Price Watch: Can XRP Break Through $2.85 and Hit $3.20?
Crypto

XRP Price Watch: Can XRP Break Through $2.85 and Hit $3.20?

Bitcoin.com News
by Bitcoin.com News
2 hours ago
20
Silver Snaps Back to $38: The Metal’s 14-Year Comeback Is Turning Heads
Crypto

Silver Snaps Back to $38: The Metal’s 14-Year Comeback Is Turning Heads

Bitcoin.com News
by Bitcoin.com News
2 hours ago
20
PUMP Ignites: Meme Coin Exchange Token Rockets 87% After Solana-Based ICO
Crypto

PUMP Ignites: Meme Coin Exchange Token Rockets 87% After Solana-Based ICO

Bitcoin.com News
by Bitcoin.com News
3 hours ago
21
Bitcoin Price Watch: High-Stakes Consolidation Could Define Q3 Trend
Crypto

Bitcoin Price Watch: High-Stakes Consolidation Could Define Q3 Trend

Bitcoin.com News
by Bitcoin.com News
4 hours ago
20
Load More
Next Post
Trump Administration Slaps 30% Tariffs on EU and Mexico

Trump Administration Slaps 30% Tariffs on EU and Mexico

ADVERTISEMENT

Follow Us

Categories

  • Crypto
  • NFTs
  • AI
  • Technology
  • Business
  • Crypto
  • NFTs
  • AI
  • Technology
  • Business
Subscribe to our Newsletter

© 2022 Metaverse Media Group – The Metaverse Mecca

Privacy and Cookie Policy | Sitemap

Welcome Back!

Sign In with Google
OR

Login to your account below

Forgotten Password? Sign Up

Create New Account!

Sign Up with Google
OR

Fill the forms below to register

*By registering into our website, you agree to the Terms & Conditions and Privacy Policy.
All fields are required. Log In

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Home
  • Crypto
  • NFTs
  • Artificial Intelligence
  • More
    • Technology
    • Business
    • Newsletter
Bitcoin

Bitcoin

$118,943.34

BTC 1.45%

Ethereum

Ethereum

$2,996.27

ETH 2.42%

  • Login
  • Sign Up
This website uses cookies. By continuing to use this website you are giving consent to cookies being used. Visit our Privacy and Cookie Policy.

Subscribe to our newsletter

Get the latest news & win monthly prizes

Subscribe to our newsletter

For the Latest News and Monthly Prize Giveaways

Join Now
Join Now