Infinite context illusion in LLMs

Although the development of Large Language Models (LLMs) with much larger context windows (such as 1 million tokens) is impressive, a key point to recognise is that "bigger" does not always equal "better." There is a common assumption that a wide context window permits users to "dump" all pertinent (and at times irrelevant) information into the prompt and expect the LLM to effortlessly integrate and make use of this information to generate better results. This neglects the nuances of how LLMs process and attend to information in their context.

Contrary to the expectation that additional context enhances comprehensibility, a number of benchmarks and observations indicate that LLM performance actually declines with increasing and more crowded contexts as the models can have a hard time finding and properly applying specific items of information that are deep within a very lengthy context, particularly with similar or redundant data.

Benchmarks identifying such challenges:

Fiction.liveBench: This benchmark gives insight into longer context model performance. (Reference: Fiction.liveBench Mar 25 Results)

OpenAI Multi-Request Context Retrieval (MRCR): As outlined in their technical report GPT-4 research (OpenAI GPT-4 Technical Report), this measure directly tests a model to see if it can retrieve and make a distinction between numerous similar requests within a large context.

Example Scenario:

The user requests a set of variations of a creative writing assignment (e.g., "compose a poem concerning tapirs," "brief tale concerning tapirs," "compose a poem concerning frogs") under the same thematic submission. The model will be requested to obtain a specific submission (e.g., "provide the third poem concerning tapirs").

Challenge: The subtle variations and the multitude of similar text will easily confuse the models and prevent them from retrieving the correct response. In addition to performance, large context windows carry substantial cost considerations when applied to API-based frameworks, and indiscriminate usage of large windows can quickly inflate operational expenses.

The Psychological Trap: Large context windows can be a source of false security, causing users to feel that they can have nearly infinite conversation or ingest any quantity of data without penalty. This leads to practices that inadvertently result in Potentially Worse Results since the model is having trouble managing the ever-larger and more complex context.

Conclusion

Strategic Context Management is the Key: Large context windows are a powerful capability, but they require strategic use. Simply maximizing context size is often counterproductive. Effective utilization involves understanding the trade-offs between context length, task complexity, model capabilities, performance, and cost. Careful prompt engineering is often more efficient and effective than relying solely on brute-force context expansion.