Retrieval Performance and Scalability Evaluation

12 min read

Key Takeaways

  • Accuracy: The system retrieves the correct document chunk within the top 200 results 97.6–98.7% of the time across all tested datasets (Hybrid search, baseline KB).

  • Token Budget: A 50k token context window captures 97%+ of relevant chunks for well-structured documents. Complex financial documents benefit from a 75k token budget.

  • Hybrid Search: Combining vector and keyword search consistently improves ranking quality by +5 to +9 Mean Reciprocal Rank (MRR) points over vector-only, with no meaningful latency penalty. Recommended as default.

  • Scalability: Retrieval quality remains stable as the knowledge base grows from approximately 136k chunks (representing ~8,000–15,000 financial documents) to 400k chunks (representing ~15,000–25,000 financial documents).

  • Latency: A single search query completes in ~0.9–1.2 seconds end-to-end (including embedding generation and network round-trip).


Introduction

Retrieval is the foundation of Retrieval-Augmented Generation (RAG). Before the language model can generate an answer, the retrieval layer must find the right document chunks from a potentially large knowledge base. The quality of the AI's response depends directly on the quality of this retrieval step — if the relevant information is not retrieved, no amount of language model sophistication can compensate.

We conducted a systematic evaluation of our retrieval layer to answer five practical questions:

  1. Does the system find the right documents? — How accurately does retrieval identify relevant content?

  2. How much context window do we need? — What token budget is sufficient to capture relevant information?

  3. Should we enable Hybrid search? — Does combining keyword and vector search improve results?

  4. Can we grow the knowledge base without losing accuracy? — How does retrieval quality scale?

  5. How fast is it? — What latency can users expect for a single search query?

We benchmarked across three datasets representing different document types commonly found in financial services, using four search modes and four knowledge base sizes ranging from ~136,000 to ~400,000 chunks.


Methodology

Datasets

We benchmarked across three datasets representing different document types and complexity levels relevant to the financial services industry:

Dataset

Description

Questions

Documents

Pages

Chunks

HR

Human resources policies and employment guidelines. Semi-structured corporate documents typical of internal knowledge bases.

82

16

370

588

Legal

EU regulatory documents — directives, regulations, and compliance texts. Highly structured with formal language, standard in regulatory and compliance workflows.

77

56

1,108

2,484

Due Diligence

Financial reports, investment documents, and mixed-format content including tables and charts. Representative of research and due diligence processes.

94

9

299

393

These datasets cover the spectrum of document types typically encountered in enterprise knowledge bases: highly structured regulatory text, semi-structured corporate policies, and complex mixed-format financial documents.

Independent Variables

Variable

Levels

Dataset type

HR, Legal, Due Diligence

Search mode

Hybrid (DBSF), Vector Only, Elastic search Only, Postgres FTS

Knowledge base size

~136k chunks, ~200k, ~300k, ~400k chunks

  • Hybrid: Combines dense vector search with keyword search (BM25), merging results using Distribution-Based Score Fusion (DBSF). Both search components run in parallel, so hybrid latency is driven by the slower of the two (typically the vector component). Appendix note on Fusion Strategies, and why we chose DBSF.

  • Vector Only: Pure semantic search using dense vector embeddings and approximate nearest neighbor (HNSW) graph traversal.

  • Elasticsearch Only: keyword search (BM25)

  • Knowledge Base sizes:

    • The 136k-chunk tier serves as the “baseline” and represents the initial state of the knowledge base, populated with a core set of financial documents.

    • For each subsequent tier (~200k, ~300k, and ~400k chunks), additional financial documents of varying formats and structures were ingested to progressively expand the knowledge base. These tiers correspond approximately to 12,000–18,000, 14,000–22,000, and 15,000–25,000 financial documents, respectively. While the primary purpose of these additions was to introduce noise and evaluate how retrieval performance scales with increasing corpus size, the ingested content remained representative of the document types commonly found in Financial Services Industry (FSI) knowledge bases. This ensured that evaluation conditions remained realistic and operationally relevant across all corpus sizes.

Measures

Measure

Definition

Recall@k

The fraction of relevant (expected) chunks found within the top k retrieved results. Recall@200 = 0.95 means 95% of relevant chunks appear in the top 200.

MRR (Mean Reciprocal Rank)

Average of 1/rank, where rank is the position of the first relevant chunk. MRR = 0.50 means the first relevant chunk appears, on average, at position 2. Higher is better.

Latency

End-to-end wall-clock time for a single search query, including embedding generation (via cloud API), vector search, keyword search (for Hybrid mode), result merging, and full HTTP round-trip.

Experiment Setup

  • Benchmarking environment: Dedicated test infrastructure matching production configuration

  • Search configuration: production-aligned search parameters:

Parameter

Value

limit

200 (chunks returned per search call)

hnsw_ef (Qdrant specific)

256 (HNSW search beam width)

oversampling (Qdrant specific)

2.0 (quantization pre-fetch multiplier)

rescore (Qdrant specific)

off

For a detailed analysis of how limit and hnsw_ef interact and their impact on latency, see Appendix.

  • Runs: Each configuration tested once per question. Determinism verified in prior studies (Jaccard similarity >= 0.99 across repeated runs)

  • Golden chunks: Each benchmark question has a human-verified "correct" chunk that the system should retrieve

  • Total configurations: 3 datasets x 4 search modes x 4 KB sizes = 48 experiment runs, covering 253 benchmark questions per run

Multi-search-string retrieval and Recall@k

Agentic search does not issue one search call per tool call: each search generates 4–6 search strings. Each search string triggers its own API call with limit=200. Results are interleaved and deduplicated into a single ranked list. Because the search strings return partially overlapping results, the final pool per question is substantially larger than 200:

Dataset

Avg. Search Strings per tool call

Avg. Unique Chunks Retrieved

HR

4.6

429

Legal

5.6

502

Due Diligence

5.4

328

This is why recall curves in this report extend to k=500: limit controls each individual search call, not the total candidate pool. Recall@k is computed on the full aggregated list — the same list the production system feeds to the language model.


Results

Each sub-section below varies one independent variable while keeping the others fixed. This isolates the effect of each variable on our three measures (Recall@k, MRR, Latency).

Effect of Search Mode

Fixed: Baseline KB (~136k chunks) - Dataset type | Varies: Search mode (Hybrid vs Vector Only vs Elastic only vs Postgres FTS)

Retrieval Accuracy (Recall@k)

by_mode_recall_at_k_no_fts-20260513-142936.png

Averaged across all three datasets, Hybrid (DBSF) achieves the highest recall from k=5 onwards, reaching 98.0% by k=200. Vector Only follows at 95.7%. Elasticsearch BM25 plateaus lower at 93.2%. The gap between modes is widest in the mid-range (k=50–150): at k=50, Hybrid leads at 92.9% compared to 86.8% for Elasticsearch BM25, a 6.1 percentage point difference. By k=200, Hybrid and Vector Only both exceed 95%, while Elasticsearch BM25 remains approximately 5 points behind. Notably, Vector Only continues climbing beyond k=200, reaching 96.9% by k=500, suggesting it finds the relevant chunks reliably — just not as near the top of the ranking. 

Ranking Quality (MRR)

by_mode_mrr_agg_no_fts-20260513-154143.png

Elasticsearch BM25 ranks the relevant chunk highest on average (MRR = 0.595), driven by strong performance on HR (0.745) where keyword matching excels on structured policies. Hybrid (DBSF) follows closely (0.584), with consistent ranking quality across HR (0.712) and Legal (0.669). Vector Only (0.509) ranks lowest across all datasets. The narrow gap between Elasticsearch BM25 and Hybrid reflects that both benefit from lexical precision — Hybrid's combined signal yields comparable ranking quality while also providing stronger recall coverage. 

Latency

image-20260519-074819.png

Elasticsearch BM25 is the fastest mode at 215 ms on average, roughly 4× faster than other search modes. Vector Only (844 ms) and Hybrid (DBSF) (837 ms) are virtually identical in latency — Hybrid is marginally faster in these measurements, confirming that keyword and vector search run in parallel and wall-clock time is governed by the slower Qdrant component. Enabling Hybrid search therefore carries no meaningful latency cost over Vector Only.


Effect of Dataset Type

Fixed: Baseline KB (~136k chunks) | Varies: Dataset (HR, Legal, Due Diligence)

Retrieval Accuracy (Recall@k)

by_dataset_recall-20260513-125941.png

All three datasets reach high recall (>97%) by k=200, but differ in how quickly they climb. HR is fastest, reaching 96.3% by k=50 thanks to its clear policy structure. Legal follows closely, jumping from 84.4% at k=10 to 96.1% at k=50 as its formal, structured language is well-suited to retrieval. Due Diligence starts lower (25.5% at k=1, 86.2% at k=50) but converges to 97.9% by k=200 — the mixed-format content (tables, charts) requires more candidates before the relevant chunk surfaces. All three plateau beyond k=200.

Ranking Quality (MRR)

by_dataset_mrr-20260513-155902.png

MRR reveals larger differences between datasets than recall. HR leads with MRR = 0.712, meaning the relevant chunk typically appears at position 1–2. Legal follows at 0.669 (position 2), and Due Diligence at 0.371 (position 3). The pattern reflects document structure: well-organized policies (HR) are easiest to rank correctly, while diverse financial documents with mixed formats (Due Diligence) present a greater ranking challenge.

Latency

by_dataset_latency_kb_scaling_rerun-20260520-072800.png

Search latency varies modestly by dataset, with all three completing in the 826–872 ms range per search string. HR is the slowest at 872 ms (16 documents), followed by Legal at 850 ms (56 documents), and Due Diligence at 826 ms (9 documents). The HR and Legal corpora tend to contain longer, denser documents, which may result in slightly longer embedding and retrieval times compared to the more concise Due Diligence summaries. All latencies are end-to-end, including embedding generation, vector search, keyword search, result merging, and HTTP round-trip.


Effect of Knowledge Base Size

Fixed: Hybrid (DBSF) search | Varies: KB size (Baseline ~136k to ~200k to ~300k to ~400k)

Note: The KB scaling results in this section were produced in a separate evaluation run from the baseline results reported in §3.1 and §3.2. Both runs used identical configuration. MRR is sensitive to the exact rank position of the first relevant chunk, so small shifts in ranking order between runs can produce minor differences in absolute MRR values between this section vs §3.1 and §3.2. Recall@200 is consistent across runs; small differences at lower k values are within normal run-to-run variance.

Retrieval Accuracy (Recall@k)

image-20260520-065256.png

Recall remains stable as the knowledge base grows. Across all three datasets, Recall@200 holds between 97.4% and 98.7% from the baseline (~136k chunks) through to the largest tier (~400k chunks) — a difference of less than 0.5 percentage points when tripling the collection size. The pattern is consistent across lower retrieval depths as well, with all tiers converging from k=50 onwards. The system reliably surfaces the relevant chunk within the top 200 results regardless of how much the knowledge base has grown.

Ranking Quality (MRR)

image-20260520-065941.png

MRR shows a minimal decline across all datasets as the KB grows, dropping from 0.566 at baseline to 0.555 at ~400k, a difference of 0.011 over the full scaling range. The additional documents introduced at each KB tier do not meaningfully push relevant results further down the ranking: the hybrid scoring continues to place them near the top with the same consistency as at baseline. For HR and Legal, the relevant chunk typically appears within the first 1 - 2 positions; for Due Diligence it appears around position 3 (MRR ≈ 0.36), reflecting the greater retrieval challenge of mixed-format financial documents rather than any KB scaling effect.

Dataset

MRR Change (baseline → ~400k)

HR

-0.016

Legal

-0.013

Due Diligence

-0.003

None of the datasets show meaningful degradation. For context on absolute MRR levels and per-dataset ranking characteristics, see §3.2.

Latency

image-20260520-065151.png

Latency increases gradually as the knowledge base grows, rising from ~849 ms per search string at baseline (~136k chunks) to ~892 ms at ~400k chunks — a total increase of approximately 5% across three tiers. The growth is consistent and monotonic, with each tier adding roughly 1–3% overhead over the previous.

The pattern is consistent across all three datasets (HR, Legal, Due Diligence), with absolute per-search-string latencies remaining in the 750–950 ms range throughout.


Conclusion

We return to the five questions that motivated this evaluation:

Does the system find the right documents?

Yes. With Hybrid search, the system retrieves the correct document chunk within the top 200 results 97–99% of the time across all three datasets. Even with Vector-Only search, recall exceeds 95%. The retrieval layer reliably identifies relevant content regardless of document type.

How much context window do we need?

image-20260520-083814.png

50k tokens for standard documents, 75k for complex ones. HR and Legal plateau at 50k tokens (97.6–98.7% recall), with no gain from increasing the budget further. Due Diligence requires 75k to reach comparable levels (96.8%), as its mixed-format content tends to rank the relevant chunk lower, requiring more context before it is reached. A 75k budget covers all document types without significant overhead.

Yes — recommended as the default. Hybrid search provides consistent quality gains, particularly in ranking quality (MRR), where improvements range from +5 to +9 points depending on the dataset. It adds minimal latency because keyword and vector search run in parallel. There is no meaningful downside to enabling it.

Can we grow the knowledge base without losing accuracy?

Yes — recall remains stable across all KB sizes. All three datasets maintain Recall@200 above 97.4% from ~136k to ~400k chunks as total knowledge base size, with no meaningful degradation. MRR shows a small but consistent decline of 0.011 across the full scaling range, which is expected as more candidates compete for top ranking positions. The system scales gracefully without sacrificing retrieval quality.

How fast is it?

~0.9–1.2 seconds per search query and limit of 200 chunks, end-to-end. This includes embedding generation via cloud API, vector search, keyword search (for Hybrid), result merging, and full HTTP round-trip. Latency scales gradually with KB size, adding approximately 5% overhead when growing from ~136k to ~400k chunks.

Practical Recommendations

Recommendation

Detail

Default search mode

Enable Hybrid (DBSF) — better quality, negligible latency cost

Token budget

50k tokens for standard documents, 75k for complex/diverse content

Scalability

KB can grow to 400k+ chunks with stable recall on well-structured documents


Appendix

How limit and hnsw_ef Interact

Qdrant internally computes the effective HNSW beam width as:

effective_ef = max(hnsw_ef, limit)

This means hnsw_ef only controls search quality when it exceeds limit. If limit is larger, it silently becomes the beam width — an HNSW search cannot return more results than candidates it explored.

hnsw_ef

limit

effective_ef

hnsw_ef controls?

256

100

256

Yes

256

200

256

Yes

256

500

500

No — limit dominates

256

1000

1000

No — limit dominates

Latency impact. The following single-variable benchmarks use the Legal dataset (134K chunks) with full end-to-end latency (HTTP round-trip, embedding, search, serialization).

Varying limit (hnsw_ef fixed at 128):

limit

effective_ef

mean latency

10

128

1,091 ms

100

128

1,340 ms

200

200

2,954 ms

500

500

6,097 ms

1000

1000

9,495 ms

While limit <= hnsw_ef, latency stays roughly constant. The moment limit exceeds hnsw_ef, latency jumps: limit=200 is 2.2x slower than limit=100, and limit=1000 is 7.1x slower.

Varying hnsw_ef (limit fixed at 100):

hnsw_ef

effective_ef

mean latency

128

128

1,459 ms

256

256

2,214 ms

512

512

3,534 ms

1024

1024

5,599 ms

Doubling the beam width adds roughly 50–75% latency at each step.

Parallel search string overhead. The system sends multiple search strings per question concurrently. This parallelism is sub-linear — wall-clock time grows, but much less than linearly:

Parallel searches

Overhead vs single

1–3

5–21%

4–5

51–63%

7–10

85–156%

With 4–5 search strings per question (our typical range), the wall-clock time for the full retrieval step is roughly 1.5–1.6x a single search call, not 4–5x.

We evaluated three strategies for merging vector and keyword search results in Hybrid mode:

Strategy

Description

Interleaved (ZIP)

Alternates results from each search type based on a weight ratio

Reciprocal Rank Fusion (RRF)

Assigns scores based on rank position in each list

Distribution-Based Score Fusion (DBSF)

Normalizes raw scores from each search type and computes a weighted average

All three strategies produce comparable Recall@200 — the choice of merge strategy does not significantly affect whether the relevant chunk is found in the top 200 results. Differences are within 1 percentage point across all datasets.

DBSF provides a modest advantage in ranking quality (MRR), meaning the relevant chunk tends to appear slightly earlier in the result list. This is why DBSF is our recommended default. However, the fusion strategy is not a critical lever — retrieval quality is driven primarily by the underlying search engines, not by how their results are merged.

Last updated