Retrieval Performance and Scalability Evaluation

12 min read

Key Takeaways

Accuracy: The system retrieves the correct document chunk within the top 200 results 97.6–98.7% of the time across all tested datasets (Hybrid search, baseline KB).
Token Budget: A 50k token context window captures 97%+ of relevant chunks for well-structured documents. Complex financial documents benefit from a 75k token budget.
Hybrid Search: Combining vector and keyword search consistently improves ranking quality by +5 to +9 Mean Reciprocal Rank (MRR) points over vector-only, with no meaningful latency penalty. Recommended as default.
Scalability: Retrieval quality remains stable as the knowledge base grows from approximately 136k chunks (representing ~8,000–15,000 financial documents) to 400k chunks (representing ~15,000–25,000 financial documents).
Latency: A single search query completes in ~0.9–1.2 seconds end-to-end (including embedding generation and network round-trip).

Introduction

Retrieval is the foundation of Retrieval-Augmented Generation (RAG). Before the language model can generate an answer, the retrieval layer must find the right document chunks from a potentially large knowledge base. The quality of the AI's response depends directly on the quality of this retrieval step — if the relevant information is not retrieved, no amount of language model sophistication can compensate.

We conducted a systematic evaluation of our retrieval layer to answer five practical questions:

Does the system find the right documents? — How accurately does retrieval identify relevant content?
How much context window do we need? — What token budget is sufficient to capture relevant information?
Should we enable Hybrid search? — Does combining keyword and vector search improve results?
Can we grow the knowledge base without losing accuracy? — How does retrieval quality scale?
How fast is it? — What latency can users expect for a single search query?

We benchmarked across three datasets representing different document types commonly found in financial services, using four search modes and four knowledge base sizes ranging from ~136,000 to ~400,000 chunks.

Methodology

Datasets

We benchmarked across three datasets representing different document types and complexity levels relevant to the financial services industry:

Dataset	Description	Questions	Documents	Pages	Chunks
HR	Human resources policies and employment guidelines. Semi-structured corporate documents typical of internal knowledge bases.	82	16	370	588
Legal	EU regulatory documents — directives, regulations, and compliance texts. Highly structured with formal language, standard in regulatory and compliance workflows.	77	56	1,108	2,484
Due Diligence	Financial reports, investment documents, and mixed-format content including tables and charts. Representative of research and due diligence processes.	94	9	299	393

These datasets cover the spectrum of document types typically encountered in enterprise knowledge bases: highly structured regulatory text, semi-structured corporate policies, and complex mixed-format financial documents.

Independent Variables

Variable	Levels
Dataset type	HR, Legal, Due Diligence
Search mode	Hybrid (DBSF), Vector Only, Elastic search Only, Postgres FTS
Knowledge base size	~136k chunks, ~200k, ~300k, ~400k chunks

Hybrid: Combines dense vector search with keyword search (BM25), merging results using Distribution-Based Score Fusion (DBSF). Both search components run in parallel, so hybrid latency is driven by the slower of the two (typically the vector component). Appendix note on Fusion Strategies, and why we chose DBSF.
Vector Only: Pure semantic search using dense vector embeddings and approximate nearest neighbor (HNSW) graph traversal.
Elasticsearch Only: keyword search (BM25)
Knowledge Base sizes:
- The 136k-chunk tier serves as the “baseline” and represents the initial state of the knowledge base, populated with a core set of financial documents.
- For each subsequent tier (~200k, ~300k, and ~400k chunks), additional financial documents of varying formats and structures were ingested to progressively expand the knowledge base. These tiers correspond approximately to 12,000–18,000, 14,000–22,000, and 15,000–25,000 financial documents, respectively. While the primary purpose of these additions was to introduce noise and evaluate how retrieval performance scales with increasing corpus size, the ingested content remained representative of the document types commonly found in Financial Services Industry (FSI) knowledge bases. This ensured that evaluation conditions remained realistic and operationally relevant across all corpus sizes.

Measures

Measure	Definition
Recall@k	The fraction of relevant (expected) chunks found within the top k retrieved results. Recall@200 = 0.95 means 95% of relevant chunks appear in the top 200.
MRR (Mean Reciprocal Rank)	Average of 1/rank, where rank is the position of the first relevant chunk. MRR = 0.50 means the first relevant chunk appears, on average, at position 2. Higher is better.
Latency	End-to-end wall-clock time for a single search query, including embedding generation (via cloud API), vector search, keyword search (for Hybrid mode), result merging, and full HTTP round-trip.

Experiment Setup

Benchmarking environment: Dedicated test infrastructure matching production configuration
Search configuration: production-aligned search parameters:

Parameter	Value
`limit`	200 (chunks returned per search call)
`hnsw_ef` (Qdrant specific)	256 (HNSW search beam width)
`oversampling` (Qdrant specific)	2.0 (quantization pre-fetch multiplier)
`rescore` (Qdrant specific)	off

For a detailed analysis of how limit and hnsw_ef interact and their impact on latency, see Appendix.

Runs: Each configuration tested once per question. Determinism verified in prior studies (Jaccard similarity >= 0.99 across repeated runs)
Golden chunks: Each benchmark question has a human-verified "correct" chunk that the system should retrieve
Total configurations: 3 datasets x 4 search modes x 4 KB sizes = 48 experiment runs, covering 253 benchmark questions per run

Multi-search-string retrieval and Recall@k

Agentic search does not issue one search call per tool call: each search generates 4–6 search strings. Each search string triggers its own API call with limit=200. Results are interleaved and deduplicated into a single ranked list. Because the search strings return partially overlapping results, the final pool per question is substantially larger than 200:

Dataset	Avg. Search Strings per tool call	Avg. Unique Chunks Retrieved
HR	4.6	429
Legal	5.6	502
Due Diligence	5.4	328

This is why recall curves in this report extend to k=500: limit controls each individual search call, not the total candidate pool. Recall@k is computed on the full aggregated list — the same list the production system feeds to the language model.

Results

Each sub-section below varies one independent variable while keeping the others fixed. This isolates the effect of each variable on our three measures (Recall@k, MRR, Latency).

Effect of Search Mode

Fixed: Baseline KB (~136k chunks) - Dataset type | Varies: Search mode (Hybrid vs Vector Only vs Elastic only vs Postgres FTS)

Retrieval Accuracy (Recall@k)

Averaged across all three datasets, Hybrid (DBSF) achieves the highest recall from k=5 onwards, reaching 98.0% by k=200. Vector Only follows at 95.7%. Elasticsearch BM25 plateaus lower at 93.2%. The gap between modes is widest in the mid-range (k=50–150): at k=50, Hybrid leads at 92.9% compared to 86.8% for Elasticsearch BM25, a 6.1 percentage point difference. By k=200, Hybrid and Vector Only both exceed 95%, while Elasticsearch BM25 remains approximately 5 points behind. Notably, Vector Only continues climbing beyond k=200, reaching 96.9% by k=500, suggesting it finds the relevant chunks reliably — just not as near the top of the ranking.

Ranking Quality (MRR)

Elasticsearch BM25 ranks the relevant chunk highest on average (MRR = 0.595), driven by strong performance on HR (0.745) where keyword matching excels on structured policies. Hybrid (DBSF) follows closely (0.584), with consistent ranking quality across HR (0.712) and Legal (0.669). Vector Only (0.509) ranks lowest across all datasets. The narrow gap between Elasticsearch BM25 and Hybrid reflects that both benefit from lexical precision — Hybrid's combined signal yields comparable ranking quality while also providing stronger recall coverage.

Latency

Elasticsearch BM25 is the fastest mode at 215 ms on average, roughly 4× faster than other search modes. Vector Only (844 ms) and Hybrid (DBSF) (837 ms) are virtually identical in latency — Hybrid is marginally faster in these measurements, confirming that keyword and vector search run in parallel and wall-clock time is governed by the slower Qdrant component. Enabling Hybrid search therefore carries no meaningful latency cost over Vector Only.

Effect of Dataset Type

Fixed: Baseline KB (~136k chunks) | Varies: Dataset (HR, Legal, Due Diligence)

Retrieval Accuracy (Recall@k)

All three datasets reach high recall (>97%) by k=200, but differ in how quickly they climb. HR is fastest, reaching 96.3% by k=50 thanks to its clear policy structure. Legal follows closely, jumping from 84.4% at k=10 to 96.1% at k=50 as its formal, structured language is well-suited to retrieval. Due Diligence starts lower (25.5% at k=1, 86.2% at k=50) but converges to 97.9% by k=200 — the mixed-format content (tables, charts) requires more candidates before the relevant chunk surfaces. All three plateau beyond k=200.

Ranking Quality (MRR)

MRR reveals larger differences between datasets than recall. HR leads with MRR = 0.712, meaning the relevant chunk typically appears at position 1–2. Legal follows at 0.669 (position 2), and Due Diligence at 0.371 (position 3). The pattern reflects document structure: well-organized policies (HR) are easiest to rank correctly, while diverse financial documents with mixed formats (Due Diligence) present a greater ranking challenge.

Latency

Search latency varies modestly by dataset, with all three completing in the 826–872 ms range per search string. HR is the slowest at 872 ms (16 documents), followed by Legal at 850 ms (56 documents), and Due Diligence at 826 ms (9 documents). The HR and Legal corpora tend to contain longer, denser documents, which may result in slightly longer embedding and retrieval times compared to the more concise Due Diligence summaries. All latencies are end-to-end, including embedding generation, vector search, keyword search, result merging, and HTTP round-trip.

Effect of Knowledge Base Size

Fixed: Hybrid (DBSF) search | Varies: KB size (Baseline ~136k to ~200k to ~300k to ~400k)
Note: The KB scaling results in this section were produced in a separate evaluation run from the baseline results reported in §3.1 and §3.2. Both runs used identical configuration. MRR is sensitive to the exact rank position of the first relevant chunk, so small shifts in ranking order between runs can produce minor differences in absolute MRR values between this section vs §3.1 and §3.2. Recall@200 is consistent across runs; small differences at lower k values are within normal run-to-run variance.

Retrieval Accuracy (Recall@k)

Recall remains stable as the knowledge base grows. Across all three datasets, Recall@200 holds between 97.4% and 98.7% from the baseline (~136k chunks) through to the largest tier (~400k chunks) — a difference of less than 0.5 percentage points when tripling the collection size. The pattern is consistent across lower retrieval depths as well, with all tiers converging from k=50 onwards. The system reliably surfaces the relevant chunk within the top 200 results regardless of how much the knowledge base has grown.

Ranking Quality (MRR)

MRR shows a minimal decline across all datasets as the KB grows, dropping from 0.566 at baseline to 0.555 at ~400k, a difference of 0.011 over the full scaling range. The additional documents introduced at each KB tier do not meaningfully push relevant results further down the ranking: the hybrid scoring continues to place them near the top with the same consistency as at baseline. For HR and Legal, the relevant chunk typically appears within the first 1 - 2 positions; for Due Diligence it appears around position 3 (MRR ≈ 0.36), reflecting the greater retrieval challenge of mixed-format financial documents rather than any KB scaling effect.

Dataset	MRR Change (baseline → ~400k)
HR	-0.016
Legal	-0.013
Due Diligence	-0.003

None of the datasets show meaningful degradation. For context on absolute MRR levels and per-dataset ranking characteristics, see §3.2.

Latency

Latency increases gradually as the knowledge base grows, rising from ~849 ms per search string at baseline (~136k chunks) to ~892 ms at ~400k chunks — a total increase of approximately 5% across three tiers. The growth is consistent and monotonic, with each tier adding roughly 1–3% overhead over the previous.

The pattern is consistent across all three datasets (HR, Legal, Due Diligence), with absolute per-search-string latencies remaining in the 750–950 ms range throughout.

Conclusion

We return to the five questions that motivated this evaluation:

Does the system find the right documents?

Yes. With Hybrid search, the system retrieves the correct document chunk within the top 200 results 97–99% of the time across all three datasets. Even with Vector-Only search, recall exceeds 95%. The retrieval layer reliably identifies relevant content regardless of document type.

How much context window do we need?

50k tokens for standard documents, 75k for complex ones. HR and Legal plateau at 50k tokens (97.6–98.7% recall), with no gain from increasing the budget further. Due Diligence requires 75k to reach comparable levels (96.8%), as its mixed-format content tends to rank the relevant chunk lower, requiring more context before it is reached. A 75k budget covers all document types without significant overhead.

Should we enable Hybrid search?

Yes — recommended as the default. Hybrid search provides consistent quality gains, particularly in ranking quality (MRR), where improvements range from +5 to +9 points depending on the dataset. It adds minimal latency because keyword and vector search run in parallel. There is no meaningful downside to enabling it.

Can we grow the knowledge base without losing accuracy?

Yes — recall remains stable across all KB sizes. All three datasets maintain Recall@200 above 97.4% from ~136k to ~400k chunks as total knowledge base size, with no meaningful degradation. MRR shows a small but consistent decline of 0.011 across the full scaling range, which is expected as more candidates compete for top ranking positions. The system scales gracefully without sacrificing retrieval quality.

How fast is it?

~0.9–1.2 seconds per search query and limit of 200 chunks, end-to-end. This includes embedding generation via cloud API, vector search, keyword search (for Hybrid), result merging, and full HTTP round-trip. Latency scales gradually with KB size, adding approximately 5% overhead when growing from ~136k to ~400k chunks.

Practical Recommendations

Recommendation	Detail
Default search mode	Enable Hybrid (DBSF) — better quality, negligible latency cost
Token budget	50k tokens for standard documents, 75k for complex/diverse content
Scalability	KB can grow to 400k+ chunks with stable recall on well-structured documents

Appendix

How `limit` and `hnsw_ef` Interact

Qdrant internally computes the effective HNSW beam width as:

effective_ef = max(hnsw_ef, limit)

This means hnsw_ef only controls search quality when it exceeds limit. If limit is larger, it silently becomes the beam width — an HNSW search cannot return more results than candidates it explored.

hnsw_ef	limit	effective_ef	hnsw_ef controls?
256	100	256	Yes
256	200	256	Yes
256	500	500	No — limit dominates
256	1000	1000	No — limit dominates

Latency impact. The following single-variable benchmarks use the Legal dataset (134K chunks) with full end-to-end latency (HTTP round-trip, embedding, search, serialization).

Varying limit (hnsw_ef fixed at 128):

limit	effective_ef	mean latency
10	128	1,091 ms
100	128	1,340 ms
200	200	2,954 ms
500	500	6,097 ms
1000	1000	9,495 ms

While limit <= hnsw_ef, latency stays roughly constant. The moment limit exceeds hnsw_ef, latency jumps: limit=200 is 2.2x slower than limit=100, and limit=1000 is 7.1x slower.

Varying hnsw_ef (limit fixed at 100):

hnsw_ef	effective_ef	mean latency
128	128	1,459 ms
256	256	2,214 ms
512	512	3,534 ms
1024	1024	5,599 ms

Doubling the beam width adds roughly 50–75% latency at each step.

Parallel search string overhead. The system sends multiple search strings per question concurrently. This parallelism is sub-linear — wall-clock time grows, but much less than linearly:

Parallel searches	Overhead vs single
1–3	5–21%
4–5	51–63%
7–10	85–156%

With 4–5 search strings per question (our typical range), the wall-clock time for the full retrieval step is roughly 1.5–1.6x a single search call, not 4–5x.

Choice of Fusion Strategy for Hybrid Search

We evaluated three strategies for merging vector and keyword search results in Hybrid mode:

Strategy	Description
Interleaved (ZIP)	Alternates results from each search type based on a weight ratio
Reciprocal Rank Fusion (RRF)	Assigns scores based on rank position in each list
Distribution-Based Score Fusion (DBSF)	Normalizes raw scores from each search type and computes a weighted average

All three strategies produce comparable Recall@200 — the choice of merge strategy does not significantly affect whether the relevant chunk is found in the top 200 results. Differences are within 1 percentage point across all datasets.

DBSF provides a modest advantage in ranking quality (MRR), meaning the relevant chunk tends to appear slightly earlier in the result list. This is why DBSF is our recommended default. However, the fusion strategy is not a critical lever — retrieval quality is driven primarily by the underlying search engines, not by how their results are merged.

Retrieval Performance and Scalability Evaluation

Key Takeaways

Introduction

Methodology

Datasets

Independent Variables

Measures

Experiment Setup

Multi-search-string retrieval and Recall@k

Results

Effect of Search Mode

Retrieval Accuracy (Recall@k)

Ranking Quality (MRR)

Latency

Effect of Dataset Type

Retrieval Accuracy (Recall@k)

Ranking Quality (MRR)

Latency

Effect of Knowledge Base Size

Retrieval Accuracy (Recall@k)

Ranking Quality (MRR)

Latency

Conclusion

Does the system find the right documents?

How much context window do we need?

Should we enable Hybrid search?

Can we grow the knowledge base without losing accuracy?

How fast is it?

Practical Recommendations

Appendix

How limit and hnsw_ef Interact

Choice of Fusion Strategy for Hybrid Search

How `limit` and `hnsw_ef` Interact