← Writing

The infinite context illusion in LLMs

April 29, 2025 · 3 minute read

Large language models with very large context windows are impressive. A model that can accept hundreds of thousands, or even millions, of tokens opens up workflows that were not practical before.

But bigger does not automatically mean better.

It is tempting to treat a long context window as a place to dump everything: notes, documents, logs, drafts, screenshots, and half-related background information. The expectation is that the model will absorb all of it and use the right parts at the right time.

That assumption breaks down quickly. Models can lose precision when the context gets crowded, especially when the prompt contains similar, redundant, or competing pieces of information. The model may still respond fluently, but it can miss a detail buried deep in the prompt or combine the wrong pieces of context.

Why large context is tricky

Longer prompts create a few practical problems:

  • Retrieval gets harder: The model has to find the right detail inside a larger pool of information.
  • Similar items compete: If the prompt contains many near-duplicate facts or requests, the model can mix them up.
  • Costs increase: Large context windows can make API usage more expensive, especially when every request carries the same bulky context.
  • Confidence becomes misleading: A fluent answer can make it look like the model used all of the context correctly, even when it did not.

Benchmarks that show the challenge

Two examples help explain the problem.

Fiction.liveBench: This benchmark gives insight into long-context model performance and how models behave when they need to reason over larger bodies of text.

OpenAI Multi-Request Context Retrieval (MRCR): This benchmark tests whether a model can retrieve and distinguish between many similar requests inside a large context.

Example scenario

The user requests a set of variations of a creative writing assignment (e.g., "compose a poem concerning tapirs," "brief tale concerning tapirs," "compose a poem concerning frogs") under the same thematic submission. The model will be requested to obtain a specific submission (e.g., "provide the third poem concerning tapirs").

The challenge is not that the model cannot read the words. The challenge is that subtle variations and repeated patterns make the retrieval task fragile. If the prompt contains many similar poems, stories, and topic variations, the model can return the wrong item while still sounding confident.

The psychological trap

Large context windows can create a false sense of safety. They make it feel like the conversation is nearly infinite and that any amount of information can be added without consequences.

In practice, the opposite often happens. The more unfocused the context becomes, the more work the model has to do to identify what matters. A shorter, cleaner prompt with the right context can outperform a much longer prompt filled with loosely related material.

What works better

Instead of relying on brute-force context, it is usually better to:

  • Summarize older conversation state into clear decisions and constraints.
  • Retrieve only the documents or sections that are relevant to the current task.
  • Remove stale assumptions when the task changes.
  • Separate competing requests into smaller prompts.
  • Keep the model's immediate goal explicit.

Conclusion

Large context windows are a powerful capability, but they do not replace context management. The goal is not to maximize the number of tokens in the prompt. The goal is to give the model the information it needs, in a form it can reliably use.

Careful context design is often more effective than simply expanding the window.