Elasticsearch Benchmarking

2 min read

Purpose

This benchmarking report compares the performance of two GPT-based response spaces, Version E and F, to evaluate consistency, accuracy, and overall answer quality. The comparison was performed in two phases:

Version E (vector search) vs Version F (vector search + elastic search)

Version F internal consistency

The goal is to identify which space delivers more reliable, complete, and contextually accurate results, using both automated and manual flag evaluation methods.

How It Works

Each comparison run evaluates new answers against reference answers using several dimensions:

Contradiction – Do the answers conflict in meaning?
Extent – Is one answer significantly longer or shorter than the other?
Hallucination – Does the answer include unsupported content?
Sources – Are sources missing compared to the human-evaluated ground truth?

A human first evaluates and confirms the ground truth before comparisons take place. The scoring uses an LLM as a judge to assess answer differences across these dimensions.

Performance Overview – Version Consistency Check

E vs F Comparison

Metric (rounded)	E	F
Contradiction	26%	34%
Extent	30%	38%
Hallucination	10%	5%
Missing Sources	62%	63%

Metric	%
One or more answers differ in actual meaning	0%
Obvious differences, but same meaning	48.23%
Very slight difference (e.g., word choice)	24.82%
Identical answers (sources can vary)	26.95%

In this comparison, 51.77% of the answers were flagged for further review. A flag is triggered when at least one aspect of the new answer—such as contradiction, extent, hallucination, or source deviation—differs from the benchmark.

E vs F (Internal Consistency)

Metric	%
Contradiction	29.17%
Extent	30.56%
Hallucination	8.33%
Missing Sources	61.11%

Metric	%
One or more answers differ in actual meaning	0%
Obvious differences, but same meaning	48.23%
Very slight difference (e.g., word choice)	24.82%
Identical answers (sources can vary)	26.95%

In the internal comparison, 51.06% of the answers were flagged for potential deviations.

Qualitative Evaluation (Version Version Comparison)

Version E tends to produce very concise and direct answers, often omitting additional context. While efficient, this approach can reduce clarity and completeness.

Version F delivers more complete and informative responses. It strikes a better balance between brevity and clarity, and rarely includes hallucinated content.

Additionally, the internal consistency shown in the internal comparison reinforces Version F's reliability for production use.

Example

“Describe the ongoing monitoring process for private market investments.”

Version E Answer:
“The company employs a comprehensive approach to monitoring.”
Version F Answer:
“Monitoring includes quarterly reports, performance tracking, fund manager engagement, and risk reassessments. The company uses internal Versions to flag underperformance and conducts regular check-ins to ensure alignment with investment strategy.”

Conclusion

Version F consistently delivers the most reliable and well-rounded performance, offering high-quality answers with strong clarity and consistency. While Version E excels in delivering concise and direct responses, it often lacks the depth and flexibility found in Version F. The strong internal consistency observed in Space F further reinforces its robustness and suitability for production environments.