Yet another study finds that overloading LLMs with information leads to worse results

Large language models are supposed to handle millions of tokens – the fragments of words and characters that make up their inputs – at once. But the longer the context, the worse their performance gets.

That’s the takeaway from a new study by Chroma Research. Chroma, which makes a vector database for AI applications, actually benefits when models need help pulling in information from outside sources. Still, the scale and methodology of this study make it noteworthy: Researchers tested 18 leading AI models, including GPT, Claude, Gemini, and Qwen, across four types of tasks. These included semantic search, repetition challenges, and question-answering in lengthy documents.

Beyond word matching

The research builds on the familiar “needle in a haystack” benchmark, where a model must pick out a specific sentence hidden inside a long block of irrelevant text. The Chroma team criticized this test for only measuring literal string matching, so they modified the test to require true semantic understanding.

Specifically, they moved beyond simple keyword recognition in two key ways. First, instead of asking a question that used the same words as the hidden sentence, they posed questions that were only semantically related. For example, in a setup inspired by the NoLiMa benchmark, a model might be asked “Which character has been to Helsinki?” when the text only states that “Yuki lives next to the Kiasma museum.” To answer, the model must make an inference based on world knowledge, not just keyword matching.

The models found this much more difficult; performance dropped sharply on these semantic questions, and the problem grew worse as the context got longer.

Second, the study looked at distractors: statements similar in content but incorrect. Adding even a single distractor noticeably reduced success rates, with different impacts depending on the distractor. With four distractors, the effect was even stronger. Claude models often refused to answer, while GPT models tended to give wrong but plausible-sounding responses.

Structure matters (but not how you’d expect)

Structure also played a surprising role. Models actually did better when the sentences in a text were randomly mixed, compared to texts organized in a logical order. The reasons aren’t clear, but the study found that context structure, not just content, is a major factor for model performance.

The researchers also tested more practical scenarios using LongMemEval, a benchmark with chat histories over 100,000 tokens long. In this separate test, a similar performance drop was observed: performance fell when models had to work with the full conversation history, compared to when they were given only the relevant sections.

The study’s recommendation: use targeted “context engineering” – picking and arranging the most relevant information in a prompt – to help large language models stay reliable in real-world scenarios. Full results are available on Chroma Research, and a toolkit for replicating the results is available for download on GitHub.

Recommendation