Retrieval Performance and Scalability Evaluation
12 min read
Key Takeaways
Accuracy: The system retrieves the correct document chunk within the top 200 results 97.6–98.7% of the time across all tested datasets (Hybrid search, baseline KB).
Token Budget: A 50k token context window captures 97%+ of relevant chunks for well-structured documents. Complex financial documents benefit from a 75k token budget.
Hybrid Search: Combining vector and keyword search consistently improves ranking quality by +5 to +9 Mean Reciprocal Rank (MRR) points over vector-only, with no meaningful latency penalty. Recommended as default.
Scalability: Retrieval quality remains stable as the knowledge base grows from approximately 136k chunks (representing ~8,000–15,000 financial documents) to 400k chunks (representing ~15,000–25,000 financial documents).
Latency: A single search query completes in ~0.9–1.2 seconds end-to-end (including embedding generation and network round-trip).
Introduction
Retrieval is the foundation of Retrieval-Augmented Generation (RAG). Before the language model can generate an answer, the retrieval layer must find the right document chunks from a potentially large knowledge base. The quality of the AI's response depends directly on the quality of this retrieval step — if the relevant information is not retrieved, no amount of language model sophistication can compensate.
We conducted a systematic evaluation of our retrieval layer to answer five practical questions:
Does the system find the right documents? — How accurately does retrieval identify relevant content?
How much context window do we need? — What token budget is sufficient to capture relevant information?
Should we enable Hybrid search? — Does combining keyword and vector search improve results?
Can we grow the knowledge base without losing accuracy? — How does retrieval quality scale?
How fast is it? — What latency can users expect for a single search query?
We benchmarked across three datasets representing different document types commonly found in financial services, using four search modes and four knowledge base sizes ranging from ~136,000 to ~400,000 chunks.
Methodology
Datasets
We benchmarked across three datasets representing different document types and complexity levels relevant to the financial services industry:
Dataset | Description | Questions | Documents | Pages | Chunks |
|---|---|---|---|---|---|
HR | Human resources policies and employment guidelines. Semi-structured corporate documents typical of internal knowledge bases. | 82 | 16 | 370 | 588 |
Legal | EU regulatory documents — directives, regulations, and compliance texts. Highly structured with formal language, standard in regulatory and compliance workflows. | 77 | 56 | 1,108 | 2,484 |
Due Diligence | Financial reports, investment documents, and mixed-format content including tables and charts. Representative of research and due diligence processes. | 94 | 9 | 299 | 393 |
These datasets cover the spectrum of document types typically encountered in enterprise knowledge bases: highly structured regulatory text, semi-structured corporate policies, and complex mixed-format financial documents.
Independent Variables
Variable | Levels |
|---|---|
Dataset type | HR, Legal, Due Diligence |
Search mode | Hybrid (DBSF), Vector Only, Elastic search Only, Postgres FTS |
Knowledge base size | ~136k chunks, ~200k, ~300k, ~400k chunks |
Hybrid: Combines dense vector search with keyword search (BM25), merging results using Distribution-Based Score Fusion (DBSF). Both search components run in parallel, so hybrid latency is driven by the slower of the two (typically the vector component). Appendix note on Fusion Strategies, and why we chose DBSF.
Vector Only: Pure semantic search using dense vector embeddings and approximate nearest neighbor (HNSW) graph traversal.
Elasticsearch Only: keyword search (BM25)
Knowledge Base sizes:
The 136k-chunk tier serves as the “baseline” and represents the initial state of the knowledge base, populated with a core set of financial documents.
For each subsequent tier (~200k, ~300k, and ~400k chunks), additional financial documents of varying formats and structures were ingested to progressively expand the knowledge base. These tiers correspond approximately to 12,000–18,000, 14,000–22,000, and 15,000–25,000 financial documents, respectively. While the primary purpose of these additions was to introduce noise and evaluate how retrieval performance scales with increasing corpus size, the ingested content remained representative of the document types commonly found in Financial Services Industry (FSI) knowledge bases. This ensured that evaluation conditions remained realistic and operationally relevant across all corpus sizes.
Measures
Measure | Definition |
|---|---|
Recall@k | The fraction of relevant (expected) chunks found within the top k retrieved results. Recall@200 = 0.95 means 95% of relevant chunks appear in the top 200. |
MRR (Mean Reciprocal Rank) | Average of 1/rank, where rank is the position of the first relevant chunk. MRR = 0.50 means the first relevant chunk appears, on average, at position 2. Higher is better. |
Latency | End-to-end wall-clock time for a single search query, including embedding generation (via cloud API), vector search, keyword search (for Hybrid mode), result merging, and full HTTP round-trip. |
Experiment Setup
Benchmarking environment: Dedicated test infrastructure matching production configuration
Search configuration: production-aligned search parameters:
Parameter | Value |
|---|---|
| 200 (chunks returned per search call) |
| 256 (HNSW search beam width) |
| 2.0 (quantization pre-fetch multiplier) |
| off |
For a detailed analysis of how limit and hnsw_ef interact and their impact on latency, see Appendix.
Runs: Each configuration tested once per question. Determinism verified in prior studies (Jaccard similarity >= 0.99 across repeated runs)
Golden chunks: Each benchmark question has a human-verified "correct" chunk that the system should retrieve
Total configurations: 3 datasets x 4 search modes x 4 KB sizes = 48 experiment runs, covering 253 benchmark questions per run
Multi-search-string retrieval and Recall@k
Agentic search does not issue one search call per tool call: each search generates 4–6 search strings. Each search string triggers its own API call with limit=200. Results are interleaved and deduplicated into a single ranked list. Because the search strings return partially overlapping results, the final pool per question is substantially larger than 200:
Dataset | Avg. Search Strings per tool call | Avg. Unique Chunks Retrieved |
|---|---|---|
HR | 4.6 | 429 |
Legal | 5.6 | 502 |
Due Diligence | 5.4 | 328 |
This is why recall curves in this report extend to k=500: limit controls each individual search call, not the total candidate pool. Recall@k is computed on the full aggregated list — the same list the production system feeds to the language model.
Results
Each sub-section below varies one independent variable while keeping the others fixed. This isolates the effect of each variable on our three measures (Recall@k, MRR, Latency).
Effect of Search Mode
Fixed: Baseline KB (~136k chunks) - Dataset type | Varies: Search mode (Hybrid vs Vector Only vs Elastic only vs Postgres FTS)
Retrieval Accuracy (Recall@k)
Averaged across all three datasets, Hybrid (DBSF) achieves the highest recall from k=5 onwards, reaching 98.0% by k=200. Vector Only follows at 95.7%. Elasticsearch BM25 plateaus lower at 93.2%. The gap between modes is widest in the mid-range (k=50–150): at k=50, Hybrid leads at 92.9% compared to 86.8% for Elasticsearch BM25, a 6.1 percentage point difference. By k=200, Hybrid and Vector Only both exceed 95%, while Elasticsearch BM25 remains approximately 5 points behind. Notably, Vector Only continues climbing beyond k=200, reaching 96.9% by k=500, suggesting it finds the relevant chunks reliably — just not as near the top of the ranking.
Ranking Quality (MRR)
Elasticsearch BM25 ranks the relevant chunk highest on average (MRR = 0.595), driven by strong performance on HR (0.745) where keyword matching excels on structured policies. Hybrid (DBSF) follows closely (0.584), with consistent ranking quality across HR (0.712) and Legal (0.669). Vector Only (0.509) ranks lowest across all datasets. The narrow gap between Elasticsearch BM25 and Hybrid reflects that both benefit from lexical precision — Hybrid's combined signal yields comparable ranking quality while also providing stronger recall coverage.
Latency
Elasticsearch BM25 is the fastest mode at 215 ms on average, roughly 4× faster than other search modes. Vector Only (844 ms) and Hybrid (DBSF) (837 ms) are virtually identical in latency — Hybrid is marginally faster in these measurements, confirming that keyword and vector search run in parallel and wall-clock time is governed by the slower Qdrant component. Enabling Hybrid search therefore carries no meaningful latency cost over Vector Only.
Effect of Dataset Type
Fixed: Baseline KB (~136k chunks) | Varies: Dataset (HR, Legal, Due Diligence)
Retrieval Accuracy (Recall@k)
All three datasets reach high recall (>97%) by k=200, but differ in how quickly they climb. HR is fastest, reaching 96.3% by k=50 thanks to its clear policy structure. Legal follows closely, jumping from 84.4% at k=10 to 96.1% at k=50 as its formal, structured language is well-suited to retrieval. Due Diligence starts lower (25.5% at k=1, 86.2% at k=50) but converges to 97.9% by k=200 — the mixed-format content (tables, charts) requires more candidates before the relevant chunk surfaces. All three plateau beyond k=200.
Ranking Quality (MRR)
MRR reveals larger differences between datasets than recall. HR leads with MRR = 0.712, meaning the relevant chunk typically appears at position 1–2. Legal follows at 0.669 (position 2), and Due Diligence at 0.371 (position 3). The pattern reflects document structure: well-organized policies (HR) are easiest to rank correctly, while diverse financial documents with mixed formats (Due Diligence) present a greater ranking challenge.
Latency
Search latency varies modestly by dataset, with all three completing in the 826–872 ms range per search string. HR is the slowest at 872 ms (16 documents), followed by Legal at 850 ms (56 documents), and Due Diligence at 826 ms (9 documents). The HR and Legal corpora tend to contain longer, denser documents, which may result in slightly longer embedding and retrieval times compared to the more concise Due Diligence summaries. All latencies are end-to-end, including embedding generation, vector search, keyword search, result merging, and HTTP round-trip.
Effect of Knowledge Base Size
Fixed: Hybrid (DBSF) search | Varies: KB size (Baseline ~136k to ~200k to ~300k to ~400k)
Note: The KB scaling results in this section were produced in a separate evaluation run from the baseline results reported in §3.1 and §3.2. Both runs used identical configuration. MRR is sensitive to the exact rank position of the first relevant chunk, so small shifts in ranking order between runs can produce minor differences in absolute MRR values between this section vs §3.1 and §3.2. Recall@200 is consistent across runs; small differences at lower k values are within normal run-to-run variance.
Retrieval Accuracy (Recall@k)
Recall remains stable as the knowledge base grows. Across all three datasets, Recall@200 holds between 97.4% and 98.7% from the baseline (~136k chunks) through to the largest tier (~400k chunks) — a difference of less than 0.5 percentage points when tripling the collection size. The pattern is consistent across lower retrieval depths as well, with all tiers converging from k=50 onwards. The system reliably surfaces the relevant chunk within the top 200 results regardless of how much the knowledge base has grown.
Ranking Quality (MRR)
MRR shows a minimal decline across all datasets as the KB grows, dropping from 0.566 at baseline to 0.555 at ~400k, a difference of 0.011 over the full scaling range. The additional documents introduced at each KB tier do not meaningfully push relevant results further down the ranking: the hybrid scoring continues to place them near the top with the same consistency as at baseline. For HR and Legal, the relevant chunk typically appears within the first 1 - 2 positions; for Due Diligence it appears around position 3 (MRR ≈ 0.36), reflecting the greater retrieval challenge of mixed-format financial documents rather than any KB scaling effect.
Dataset | MRR Change (baseline → ~400k) |
|---|---|
HR | -0.016 |
Legal | -0.013 |
Due Diligence | -0.003 |
None of the datasets show meaningful degradation. For context on absolute MRR levels and per-dataset ranking characteristics, see §3.2.
Latency
Latency increases gradually as the knowledge base grows, rising from ~849 ms per search string at baseline (~136k chunks) to ~892 ms at ~400k chunks — a total increase of approximately 5% across three tiers. The growth is consistent and monotonic, with each tier adding roughly 1–3% overhead over the previous.
The pattern is consistent across all three datasets (HR, Legal, Due Diligence), with absolute per-search-string latencies remaining in the 750–950 ms range throughout.
Conclusion
We return to the five questions that motivated this evaluation:
Does the system find the right documents?
Yes. With Hybrid search, the system retrieves the correct document chunk within the top 200 results 97–99% of the time across all three datasets. Even with Vector-Only search, recall exceeds 95%. The retrieval layer reliably identifies relevant content regardless of document type.
How much context window do we need?
50k tokens for standard documents, 75k for complex ones. HR and Legal plateau at 50k tokens (97.6–98.7% recall), with no gain from increasing the budget further. Due Diligence requires 75k to reach comparable levels (96.8%), as its mixed-format content tends to rank the relevant chunk lower, requiring more context before it is reached. A 75k budget covers all document types without significant overhead.
Should we enable Hybrid search?
Yes — recommended as the default. Hybrid search provides consistent quality gains, particularly in ranking quality (MRR), where improvements range from +5 to +9 points depending on the dataset. It adds minimal latency because keyword and vector search run in parallel. There is no meaningful downside to enabling it.
Can we grow the knowledge base without losing accuracy?
Yes — recall remains stable across all KB sizes. All three datasets maintain Recall@200 above 97.4% from ~136k to ~400k chunks as total knowledge base size, with no meaningful degradation. MRR shows a small but consistent decline of 0.011 across the full scaling range, which is expected as more candidates compete for top ranking positions. The system scales gracefully without sacrificing retrieval quality.
How fast is it?
~0.9–1.2 seconds per search query and limit of 200 chunks, end-to-end. This includes embedding generation via cloud API, vector search, keyword search (for Hybrid), result merging, and full HTTP round-trip. Latency scales gradually with KB size, adding approximately 5% overhead when growing from ~136k to ~400k chunks.
Practical Recommendations
Recommendation | Detail |
|---|---|
Default search mode | Enable Hybrid (DBSF) — better quality, negligible latency cost |
Token budget | 50k tokens for standard documents, 75k for complex/diverse content |
Scalability | KB can grow to 400k+ chunks with stable recall on well-structured documents |
Appendix
How limit and hnsw_ef Interact
Qdrant internally computes the effective HNSW beam width as:
effective_ef = max(hnsw_ef, limit)This means hnsw_ef only controls search quality when it exceeds limit. If limit is larger, it silently becomes the beam width — an HNSW search cannot return more results than candidates it explored.
hnsw_ef | limit | effective_ef | hnsw_ef controls? |
|---|---|---|---|
256 | 100 | 256 | Yes |
256 | 200 | 256 | Yes |
256 | 500 | 500 | No — limit dominates |
256 | 1000 | 1000 | No — limit dominates |
Latency impact. The following single-variable benchmarks use the Legal dataset (134K chunks) with full end-to-end latency (HTTP round-trip, embedding, search, serialization).
Varying limit (hnsw_ef fixed at 128):
limit | effective_ef | mean latency |
|---|---|---|
10 | 128 | 1,091 ms |
100 | 128 | 1,340 ms |
200 | 200 | 2,954 ms |
500 | 500 | 6,097 ms |
1000 | 1000 | 9,495 ms |
While limit <= hnsw_ef, latency stays roughly constant. The moment limit exceeds hnsw_ef, latency jumps: limit=200 is 2.2x slower than limit=100, and limit=1000 is 7.1x slower.
Varying hnsw_ef (limit fixed at 100):
hnsw_ef | effective_ef | mean latency |
|---|---|---|
128 | 128 | 1,459 ms |
256 | 256 | 2,214 ms |
512 | 512 | 3,534 ms |
1024 | 1024 | 5,599 ms |
Doubling the beam width adds roughly 50–75% latency at each step.
Parallel search string overhead. The system sends multiple search strings per question concurrently. This parallelism is sub-linear — wall-clock time grows, but much less than linearly:
Parallel searches | Overhead vs single |
|---|---|
1–3 | 5–21% |
4–5 | 51–63% |
7–10 | 85–156% |
With 4–5 search strings per question (our typical range), the wall-clock time for the full retrieval step is roughly 1.5–1.6x a single search call, not 4–5x.
Choice of Fusion Strategy for Hybrid Search
We evaluated three strategies for merging vector and keyword search results in Hybrid mode:
Strategy | Description |
|---|---|
Interleaved (ZIP) | Alternates results from each search type based on a weight ratio |
Reciprocal Rank Fusion (RRF) | Assigns scores based on rank position in each list |
Distribution-Based Score Fusion (DBSF) | Normalizes raw scores from each search type and computes a weighted average |
All three strategies produce comparable Recall@200 — the choice of merge strategy does not significantly affect whether the relevant chunk is found in the top 200 results. Differences are within 1 percentage point across all datasets.
DBSF provides a modest advantage in ranking quality (MRR), meaning the relevant chunk tends to appear slightly earlier in the result list. This is why DBSF is our recommended default. However, the fusion strategy is not a critical lever — retrieval quality is driven primarily by the underlying search engines, not by how their results are merged.