Benchmarking

11 min read

Overview

The benchmarking service of Unique is designed to evaluate and ensure the quality of responses from language models and virtual assistants. It allows users to:

  • Test Accuracy: Automatically generate and compare responses to a set of benchmark questions to assess accuracy and performance.

  • Monitor Consistency: Detect deviations or drifts in model behavior over time to maintain consistent output quality.

  • Refine and Improve: Utilize detailed metrics to pinpoint areas for enhancement and validate the impact of updates or changes to the system.

This tool is critical for organizations looking to optimize the effectiveness and reliability of their AI-powered solutions.

Generate your Benchmarking ground truth

If you have not created a benchmarking set with typical questions and expected answers yet, there are basically two ways of doing so. One, very intuitive way (see Option 1), is by prompting the questions directly into the chat interface and rate them with the feedback option. Another way (see Option 2) is to create the first answers with the benchmarking template in the benchmarking interface.

Option 1: Directly generate your benchmarking answers yourself

You can gather your benchmarking answers by prompting your questions directly in to single chat conversations and rate them

Step 1: Gather typical questions per space and prompt them

  • Gather typical user questions for the individual spaces.

  • Prompt the questions, each in a new chat conversation. Please prompt the questions structured, one by one (this can also be done in batches via benchmarking interface - please approach your CS responsible for further information).

  • Check the answer in every single conversation and rate them

    • Give a 👍 if the answer is satisfying. Leaving a comment is optional

    • Give a 👎 if the answer is unsatisfying (not finding any information, incomplete information, incorrect information etc.). Please in this case mention in the free text field what was missing in the answers to be correct and which source you would expect to be chosen by the module

Step 2: Pull Feedback and apply first improvements

  • Once all the questions are entered you can go to the Feedback interface and pull the consolidated feedback (sortable by space within the excel)

  • Check the answers with your in-house specialist or the DS lead from Unique’s side.

  • Implement first measures to improve answers and re-run them

Step 3: Enter the questions and answers in the benchmarking template

  • Pull a finale version of the feedback output from the feedback interface

  • Sort for the relevant space

  • Copy and paste the following information to the benchmarking Excel template (also include questions that are still not satisfying although measurements are in place): Question (Column B), Assistant (Column C), Correct Benchmark (Column D), Answer (Column E) and Sources (Column F).

Option 2: Automatically generating benchmark answers

You can also automatically let the benchmarking create answers for you.

Step 1: Gather typical questions per space

  • You must be a member of the space.

  • If you are not assigned to a space included in the benchmark, the benchmark will not run for that space.

Fill in the following columns:

  • Column A – “id”: Enter the rows in ascending order. These numbers will help you locate easier the questions when you will compare answers.

  • Column B – “Question”: Add user questions collected from the individual spaces.

  • Column C – “Assistant”: Specify the space from which each question was taken.

  • Column D – “correct_benchmark”: This column indicates whether a reference answer exists to compare with the new run. Since this is the first run, enter “No.” You only need to complete these four columns. The remaining columns will be filled automatically once the benchmarking process is executed.

Benchmark_Input_template.xlsxExcel spreadsheetDownload

Step 2: Upload the Excel File to the Benchmarking

  • Drag and drop the Excel file in the benchmarking section of the Unique AI Platform

  • You can see a “In Progress” tag next to the file name if the upload worked.

Step 3: Download file with automatically generated answers and review and classify it

image-20240130-074811.png

If the questions were completed you will see a green tag indicating “ready” next to the file.

  • You can now download the file by clicking the download icon next to the files.

  • If you open the file you should see the generated answers in column J “answer”.

  • Manually review all the generated answers and indicate in column D “correct_benchmark” if the answer is as expected or if not (e.g., not finding any information, incomplete information, incorrect information).

    • Add a “yes” if the answer is as expected

    • Add a “no” if the answer is not as expected

image-20240130-073926.png
info

Tip: Include negative examples in your benchmarking set enables you to compare the quality of answers over time. Example: with GPT-3.5 80% of the answers were correct (“yes”), and with GPT-4 used 90% of the answers were correct.

Compare Answers with Benchmark

Step 1: Upload benchmarking file

  • Make sure your benchmarking file includes the following information

    • Question (Column B)

    • Assistant (Column C)

    • correct_benchmark (Column D) and answer benchmark (column E) in the benchmarking section of the Unique AI Platform

    • sources_used_benchmark (column F): optional, if you want to compare the sources

image-20240130-073926.png
  • Drag and drop the Excel file in the benchmarking section of the Unique AI Platform

  • You can see a “In Progress” tag next to the file name, if the upload worked.

Step 2: Download the benchmarking file

  • If the questions were completed you will see a green tag indicating “ready” next to the file.

  • You can now download the file by clicking the download icon next to the files.

Step 3: Review Flags within file (incl. column description)

image-20240201-063925.png

It is recommended to FILTER the final_flag (column Z) for TRUE and manually evaluate how the new answers are different from the benchmark answers.

The columns with “flags” in their name perform an automated test using GPT to evaluate if the benchmark answer and the newly generated answer match.

  • FALSE: means the test did NOT find a significant deviation in the results

  • TRUE: means the test found a significant deviation in the results. These results should be checked manually by a human.

Column Z (final_flag) is a summary of all the tests., meaning if a deviation was found in one of the tests (TRUE), the final flag will always be TRUE.

Explanation of the columns:

  • Answer (Column J): automatically generated answer in the comparison run. These answers are compared to the benchmark answers (column E: answer_benchmark)

  • Sources (Column K): automatically generates a list of the used sources in the comparison run. These are compared to the benchmark sources- if available (column F: answer_benchmark)

  • Modules (Column L): Coming soon - will indicate which module has been selected (e.g., search, follow-up, etc.)

  • Followup (Column M): Coming soon - will indicate if it is a follow-up question or not

  • ChatMessages (Column N): debug information that can be used to debug a problem after identifying one.

  • emb_text (Column O): coming soon (ignore for now). This field will contain the cosine similarity of the embeddings of the reference answer and generated answer. The closer that value is to 1, the larger the overlap between the answers.

  • emb_flag (Column P): coming soon (ignore for now). TRUE if the similarity is below the threshold of 0.92

  • contra_text (Column Q): Explanation of why the contra_flag was set to TRUE

  • contra_flag: (Column R): TRUE, if the two answers contradict each other or have a significantly other meaning.

  • ext_text (Column S): Explanation of why the ext_flag was set to TRUE

  • ext_flag (Column T): TRUE if the two answers differ in their extent (e.g., one includes significantly more or less information than the other)

  • halluzination_text (Column U): Explanation of why the halluzination_text was set to TRUE

  • halluzination_flag: (Column V): TRUE, if the answer indicates hallucinations. This is tested by comparing the generated answer with the content of the referenced sources. If the answer contains any information that is not present in the sources, the hallucinations flag is TRUE.

  • source_flag (Column W): TRUE, if the newly generated answer is missing at least one reference from the benchmark answer.

  • module_flag (Column X): coming soon (ignore for now)

  • relation_flag (Column Y): coming soon (ignore for now)

  • final_flag (Column Z): Column Z (final_flag) is a summary of all conducted tests (columns O-Y), meaning if a deviation was found in one of the tests in column O-Y (TRUE), the final flag would always be set to TRUE. It is recommended that final_flag is filtered for TRUE.

  • explanation (Column AA): Explanation of why the final_flag was set to TRUE

Step 4: Add manual evaluation flagged answers

We recommend you to add an additional column AB where you add the result of your review (human review). You can name it “human_review_answer_correct” and add a “yes” or “no” as answer. Similar to column D (correct_benchmark), you evaluate here if the new answers are correct or not.

For all the answers with final_flag=TRUE, it is recommended to manually review it and evaluate if the answer is different from the benchmark but still correct. If it is correct add a “yes” in the column “human_review_answer_correct”.

With the automated tests, we estimate a high probability for all the answers with the final_flag = FALSE to be correct and you could set all these columns to “yes” after reviewing some single samples.

info

This is the first version of the automated tests. Please report to us if you notice that some flags tend to have a lot of false positive results, meaning the result is TRUE but the answers are correct and comparable

Definition of benchmark metrics/scores

The evaluation of whether a generated response is considered equivalent to the benchmark run is carried out by combining numerous metrics. Even if a single metric shows a possible anomaly, a possible deviation is signaled and noted for manual analysis. This section explains the different metrics in detail.

Embedding Comparison

This metric assesses the degree of similarity between the embeddings of the reference answer and the new benchmark answer. A high similarity score indicates substantial content overlap between the two answers. The threshold score for comparison is set at a value of 0.92. Should the similarity score fall below this threshold, it is deemed a considerable divergence between the new benchmark response and the reference answer. In this case, the system marks the result of this test as TRUE, otherwise as FALSE.

Diagram: Untitled Diagram-1720168512949

Contradiction Comparison

This metric evaluates the consistency between the reference response and the response from a new benchmark test by checking for contradictions. Both responses are submitted to a GPT model for analysis. If the model detects any contradictory statements between the two, it will return TRUE, indicating inconsistency. If no contradictions are found, it will return FALSE, confirming that the responses are consistent.

Diagram: Untitled Diagram-1720168814283

Extent Comparison

This metric is designed to evaluate the comprehensiveness and overlap of the two answers, the reference, and new benchmark run, in relation to the benchmark question. The reference answer is assumed to contain the expected information. The objective is to ascertain whether one of the answers addresses the question more thoroughly than the other. The outcome is binary (true/false): if either the new benchmark answer or the reference answer provides a more comprehensive response to the question, the metric is set to TRUE. In this case, the two answers do not provide a response to the user’s question to the same extent. Conversely, if both answers exhibit equal comprehensiveness in addressing the question, the response is FALSE.

Diagram: Untitled Diagram-1720169006715

Hallucination

The purpose of this metric is to determine if all information contained in the response is purely taken from the provided sources, meaning that the model is not hallucinating. A GPT-4o call evaluates whether the answer is (a) fully, (b) partially, or (c) not at all supported by the provided sources.

  • Fully supported: the generated answer is fully consistent with the sources. No additional information is contained in the answer that is not part of the sources

  • Partially supported: the output is consistent with the sources but contains some unsupported elements

  • No support: the information in the answer is not at all taken from the sources

If the generated answer is only partially or not at all supported by the provided sources, this indicates hallucination, and the metric is set to TRUE, else to FALSE.

Diagram: Untitled Diagram-1720169146731

Reference (Source) Comparison

This metric compares the reference sources from the reference and the new answer. The purpose is to analyze if the same documents were taken to generate the answer, which indicates a consistent answer content. If all sources contained in the reference answer are also part of the new answer, this metric is FALSE, or else TRUE.

Module Comparison

If an assistant contains multiple modules, a module selector chooses the most suitable module for a user input. The choice of the module has a big impact of the answer structure and quality, as each module is optimized for a different use case (e.g. knowledge search or translation). Therefore it is crucial that the module choice is consistent for the same user input. This metric compares the chosen module for the reference and benchmark run. If there is an overlap, the metric is FALSE, else TRUE.

Final Flag

The final assessment of whether a generated response is considered equivalent to the benchmark run is made by combining all of the above metrics. Only if all metrics are marked as FALSE is the new response considered equivalent to the reference response. If at least one metric is TRUE, this response is marked as potentially deviating and must be analyzed manually.

Diagram: Untitled Diagram-1720169385441

Error Codes

Error Code

Description

Benchmark_01

Skipping row because of missing data (question or assistant)

Benchmark_02

Benchmark object of BenchmarkEntry not found

Benchmark_03

Provided Assistant not found

Benchmark_04

User message (question) not found after creation of message.

Benchmark_05

Assistant message (answer) not found after creation of message or not marked as completed (External Modules)

Benchmark_06

Assistant message (answer) has no originalText for further processing

Benchmark_07

Error while doing the benchmark of an entry

Benchmark_08

Missing result of the comparison of a benchmark entry
correct_benchmark not set?

Benchmark_09

MessageCreate Failed - Could not create a new chat and send the message

Benchmark_10

Error while validating the results of a benchmark entry

Benchmark_98

Benchmarking: Run Aborted

Benchmark_99

General Error

Adopt benchmark set

After changing to a new version of prompts or LLMs for a space you shall also change your benchmarking set, as usually the answers were improved compared to the original benchmark set. Just create a new file and copy the column J-N from your last run to column E-I. 

Last updated