AI Verify Foundation
4 min read
Executive Summary
We teamed up with QuantPi in the AI Verify pilot to test our RAG-based Investment Research Assistant. We focused on the big risks: making sure answers are grounded in the right documents, the search pulls the right info, and results stay reliable (even with typos or tricky questions), plus staying within policy and protecting data. We ran about 200 realistic tests in a secure setup and measured both the search and the AI’s answers with simple stats and an AI judge checked by domain experts. The results showed where retrieval quality drives answer quality, helped us tweak search and prompts, and gave us a repeatable way to keep the assistant accurate and trustworthy in real use.
Background: Unique at the AI Verify Foundation
Since October 2024, Unique is a proud member of the AI Verify Foundation (https://aiverifyfoundation.sg/), the global open‑source initiative by IMDA Singapore to build trustworthy AI through practical testing norms and tools.
In Feb–May 2025, AIVF ran the Global AI Assurance Pilot (https://assurance.aiverifyfoundation.sg/), pairing 17 real‑world GenAI applications with 16 specialist testing firms to codify “what to test” and “how to test” for application reliability beyond model safety.
Our Pairing and Use Case
Pairing: Unique (deployer) × QuantPi (https://www.quantpi.com/)(tester), one of the pilot’s official matches.
Use case: Investment Research Assistant (an LLM) enabled assistant for bank relationship managers to query stock universes, analyze fact sheets, and draft tailored investment rationales and follow‑up emails. The system uses Retrieval‑Augmented Generation (RAG) and supports enterprise deployments (cloud/on‑prem) with strict compliance and data governance.
Risks We Prioritized
From a broader risk list, we selected the risks most relevant to our financial research context:
Accuracy/faithfulness of generated outputs to the provided context; minimizing hallucinations.
Retrieval layer reliability (inaccurate or irrelevant results) impacting downstream answer quality.
Robustness across query difficulty, domain bias, and tolerance to typos; consistency of advice quality across segments.
Non‑adherence to internal policy or regulatory expectations in recommendations (duty of care, restricted products).
Data prudence risks such as oversharing client data and potential advisor misuse; downstream poor outcomes if grounding is weak.
How We Measured Reliability
We combined metrics suited to our RAG pipeline with qualitative calibration:
For generator faithfulness and hallucination:
Cosine similarity thresholds (0.4 and 0.8) between predicted responses and ground truth to assess semantic closeness.
Faithfulness metric: claim‑level grounding of responses against retrieved context (higher faithfulness → lower hallucination).
For search/retrieval quality:
Word Overlap Rate: overlap of ground‑truth words with retrieved context chunks .
Mean Reciprocal Rank (MRR): how quickly the first relevant item appears in ranked results.
Lenient Retrieval Accuracy: whether relevant context is present in the retrieved set (binary).
Evaluators used:
Rule‑based logic and surface/semantic metrics, plus LLM‑as‑judge where appropriate, calibrated by human SMEs to align automated scores with domain expectations.
Test Design and Implementation
Testing approach:
Two complementary scopes: the Investment Research Assistant (RAG) and a Document Search subsystem, isolating search layer risks vs. generator faithfulness.
Scenarios varied across query types, domains, lengths, complexity, and injected typos to probe robustness and bias.
Environment and data:
Secure staging environment with strict access controls; asynchronous collaboration with NDA in place.
Approx. 200 samples total: ~100 provided by Unique + ~100 perturbed/adversarial variants introduced by QuantPi; anonymized but realistic financial data.
Effort and cost:
Unique: ~12 hours engineering integration + ~4 hours SME review; QuantPi: ~60 hours setup/execution/analysis; minimal direct model cost in pilot scope.
Platform and tooling:
QuantPi’s PiCrystal testing engine (embedders, perturbers, metrics) to assemble repeatable scenarios across inputs/outputs for black‑box assessment, with scalable automation and visualized testing logic.
Read more in our Blog Post about how Unique’s Use Cases are tested
What Have We Learnt
Scoped risks and selected metrics for RAG reliability and content faithfulness aligned to financial research tasks and SME judgement .
Ran structured tests over our assistant and document search, using rule/statistical evaluators and LLM‑as‑judge where needed; applied typos and difficulty perturbations to test robustness.
Documented insights on where retrieval ranking and evidence selection materially affect downstream correctness, informing changes to search strategies and prompt scaffolding.
Challenges We Faced
Problem of obtaining fit‑for‑purpose financial test data and “golden” ground truth; constraints on sharing sensitive attributes and internal traces.
Small sample sizes in pilot timelines reduced statistical significance; interpreting scores without comparable baselines is non‑trivial.
Limited ability (by design) to expose all internal agent/RAG subsystems due to IP protection and time constraints, even though subsystem testing improves diagnostics.
Key Insights We Took Away
Test what matters: context and use‑case drive risk selection; for RAG apps, retrieval quality is foundational to output reliability.
Don’t expect perfect datasets: plan for synthetic augmentation, adversarial and simulation testing to cover edge cases.
Look under the hood: evaluate interim pipeline touchpoints where feasible to triangulate outcomes and aid debugging; agentic flows benefit strongly from granular tests.
Use LLM‑as‑judge with caution: design prompts carefully, calibrate with SMEs, and complement with simpler statistical/rule‑based measures where possible.