Context Window Impact
1 min read
Purpose
This benchmarking report evaluates the impact of context window length on GPT-4o’s response consistency.
Version B: GPT-4o with a 30,000-token context window
Version D: GPT-4o with a 124,000-token context window
The goal is to assess whether expanding the context window improves consistency across repeated runs using the same prompts and retrieval setup.
Performance Overview – Model Consistency Check
Metric (rounded) | 30K Context (B) | 124K Context (D) |
|---|---|---|
Obvious differences, but same meaning | 63% | 63% |
Very slight difference (e.g., word choice) | 11% | 14% |
Identical answers (sources can vary) | 26% | 21% |
One or more answers differ in actual meaning | 0% | 1% |
This comparison shows that increasing the context window from 30,000 to 124,000 tokens did not improve consistency. The 30K version performed slightly better in delivering identical or near-identical responses. Both configurations maintained similar rates of variation in meaning.
Example
Question: “What are the ESG integration criteria?”
30K: One version used paragraph form; another used bullet points.
124K: Similar shifts in structure and emphasis were observed.
Conclusion
Expanding the context window to 124,000 tokens did not enhance output consistency in this benchmark. The 30,000-token setup offered slightly better repeatability in terms of identical and closely matching responses. These results suggest that for short to medium prompts, increasing the context limit does not necessarily improve stability — its benefits may lie more in handling long, complex documents.
RAG Configuration Details
Context window:
Version B: 30,000 tokens
Version D: 124,000 tokens
Chunk relevancy sorting:
Search method:
LLM seed:
Temperature: