Context Window Impact

1 min read

Purpose

This benchmarking report evaluates the impact of context window length on GPT-4o’s response consistency.

Version B: GPT-4o with a 30,000-token context window
Version D: GPT-4o with a 124,000-token context window

The goal is to assess whether expanding the context window improves consistency across repeated runs using the same prompts and retrieval setup.

Performance Overview – Model Consistency Check

Metric (rounded)	30K Context (B)	124K Context (D)
Obvious differences, but same meaning	63%	63%
Very slight difference (e.g., word choice)	11%	14%
Identical answers (sources can vary)	26%	21%
One or more answers differ in actual meaning	0%	1%

This comparison shows that increasing the context window from 30,000 to 124,000 tokens did not improve consistency. The 30K version performed slightly better in delivering identical or near-identical responses. Both configurations maintained similar rates of variation in meaning.

Example

Question: “What are the ESG integration criteria?”

30K: One version used paragraph form; another used bullet points.
124K: Similar shifts in structure and emphasis were observed.

Conclusion

Expanding the context window to 124,000 tokens did not enhance output consistency in this benchmark. The 30,000-token setup offered slightly better repeatability in terms of identical and closely matching responses. These results suggest that for short to medium prompts, increasing the context limit does not necessarily improve stability — its benefits may lie more in handling long, complex documents.

RAG Configuration Details

Context window:
- Version B: 30,000 tokens
- Version D: 124,000 tokens
Chunk relevancy sorting:
Search method:
LLM seed:
Temperature: