Hallucination Evaluation

3 min read

The service is integrated into the following spaces and modules:

Question Answerer

Introduction

In Retrieval-Augmented Generation (RAG) systems, language models generate responses by combining retrieved documents with their pre-trained knowledge. This hybrid approach aims to improve the relevance and accuracy of generated content by grounding it in real data. However, even in RAG settings, hallucinations — instances where the model produces information not supported by the retrieved sources — can still occur. Evaluating and mitigating these hallucinations is crucial for ensuring the reliability of the generated outputs.

Customizable Hallucination Warning Messages

The platform offers the ability to customize hallucination warning messages. This feature allows the standard hallucination alert to be replaced with more tailored messages, such as, “Be aware, hallucination detected.” This customization ensures more relevant and informative guidance, helping to maintain accuracy and clarity throughout the platform.

Importance of Hallucination Evaluation

Context:
In a RAG-based system, the generation of accurate and reliable responses is critical, especially in high-stakes fields such as finance, and legal services. Hallucinations in this context can lead to misinformation, causing potential harm or leading to incorrect decisions. Therefore, evaluating hallucination levels is essential to maintain the integrity of the responses generated by the system.

Challenges:
Hallucinations can arise from various factors:

Model Bias: The language model may introduce biases or assumptions that are not present in the retrieved documents.
Incomplete Retrieval: If the retrieved documents do not fully cover the topic, the model might extrapolate or generate additional information to fill gaps.
Complex Queries: For complex or ambiguous queries, the model might generate more speculative responses that go beyond the content of the retrieved documents.

Hallucination Level Metric

To systematically evaluate and mitigate these hallucinations, a hallucination level metric can be applied to the responses generated by RAG systems. This metric categorizes responses into three levels based on their adherence to the retrieved sources:

Low Hallucination: The response is almost entirely grounded in the retrieved documents, with minimal or no deviation. This level indicates that the generated content is highly reliable and closely aligned with the source material.
Medium Hallucination: The response generally follows the retrieved documents but includes some additional details or interpretations that are not explicitly supported by the sources. While still useful, these responses require some caution, especially in critical decision-making scenarios.
High Hallucination: The response significantly deviates from the retrieved documents, introducing information that is not supported by the source material. This level suggests that the response may be unreliable and should be cross-verified or treated with skepticism.

Implementation in RAG Systems

Workflow:

Document Retrieval: The system first retrieves relevant documents based on the user's query.
Response Generation: The language model generates a response by synthesizing information from these documents.
Hallucination Evaluation: The hallucination level metric is applied to the generated response to assess how closely it aligns with the retrieved content.

Benefits:

Enhanced Trust: By providing a clear indication of the hallucination level, users can better gauge the trustworthiness of the generated response.
Improved Decision-Making: In fields requiring high accuracy, the ability to identify and assess hallucination levels can lead to better, more informed decisions.
Continuous Improvement: Feedback from hallucination evaluations can be used to refine the user prompt.

Configuration

Default Configuration

json

{
    "enabled": false,
    "name": "hallucination",
    "languageModel": "AZURE_GPT_4o_2024_1120",
    "additionalLlmOptions": {},
    "customPrompts": {
        "systemPrompt": "\nYou will receive a question, references, a conversation between a user and an agent, and an output. \nThe output is the answer to the question. \nYour task is to evaluate if the output is fully supported by the information provided in the references and conversation, and provide explanations on your judgement in 2 sentences.\n\nUse the following entailment scale to generate a score:\n[low] - All information in output is supported by the references/conversation, or extractions from the references/conversation.\n[medium] - The output is supported by the references/conversation to some extent, but there is at least some information in the output that is not discussed in the references/conversation. For example, if an instruction asks about two concepts and the references/conversation only discusses either of them, it should be considered a [medium] hallucination level.\n[high] - The output contains information that is not part of the references/conversation, is unrelated to the references/conversation, or contradicts the references/conversation.\n\nMake sure to not use any external information/knowledge to judge whether the output is true or not. Only check whether the output is supported by the references/conversation, and not whether the output is correct or not. Also do not evaluate if the references/conversation contain further information that is not part of the output but could be relevant to the question. If the output mentions a plot or chart, ignore this information in your evaluation.\n\nYour answer must be in JSON format:\n{\n \"reason\": Your explanation of your judgement of the evaluation,\n \"value\": decision, must be one of the following: [\"high\", \"medium\", \"low\"]\n}                                                  \n",
        "userPrompt": "\nHere is the data:\n\nInput:\n'''\n$input_text\n'''\n\nReferences:\n'''\n$contexts_text\n'''\n\nConversation:\n'''\n$history_messages_text\n'''\n\nOutput:\n'''\n$output_text\n'''\n\nAnswer as JSON:\n",
        "systemPromptDefault": "\nYou will receive a question and an output. \nThe output is the answer to the question. \nThe situation is that no references could be found to answer the question. Your task is to evaluate if the output contains any information to answer the question,\nand provide a short explanations of your reasoning in 2 sentences. Also mention in your explanation that no references were provided to answer the question.\n\nUse the following entailment scale to generate a score:\n[low] - The output does not contain any information to answer the question.\n[medium] - The output contains some information to answer the question, but does not answer the question entirely. \n[high] - The output answers the question.\n\nIt is not considered an answer when the output relates to the questions subject. Make sure to not use any external information/knowledge to judge whether the output is true or not. Only check that the output does not answer the question, and not whether the output is correct or not.\nYour answer must be in JSON format:\n{\n \"reason\": Your explanation of your reasoning of the evaluation,\n \"value\": decision, must be one of the following: [\"low\", \"medium\", \"high\"]\n}\n",
        "userPromptDefault": "\nHere is the data:\n\nInput:\n'''\n$input_text\n'''\n\nOutput:\n'''\n$output_text\n'''\n\nAnswer as JSON:\n"
    },
    "scoreToLabel": {
        "LOW": "GREEN",
        "MEDIUM": "YELLOW",
        "HIGH": "RED"
    },
    "scoreToTitle": {
        "LOW": "No Hallucination Detected",
        "MEDIUM": "Hallucination Warning",
        "HIGH": "High Hallucination"
    }
}

Parameters

Field Name	Description	Type	Default Value	Required
`enabled`	Whether to enable the chunk relevancy sort.	boolean	`false`	Yes
`name`	Name of the metric to be calculated.	string	`hallucination`	Yes
`languageModel`	Language model to be used	Language Model Info	`AZURE_GPT_4o_2024_1120`	No
`additionalLlmOptions`	Additional parameters for the LLM	dict	`{}`	No
`customPrompts`	Dictionary with prompts.	dict	See Nested Parameters below	No
`scoreToLabel`	Mapping of hallucination scores to labels	dict	See Default Configuration above	No
`scoreToTitle`	Mapping of hallucination scores to descriptive titles	dict	See Default Configuration above	No

Nested Parameters

customPrompts

Field Name	Description	Type	Default Value
`systemPrompt`	System prompt for hallucination check	string	See Default Configuration above
`userPrompt`	User prompt for hallucination check	string	See Default Configuration above
`systemPromptDefault`	System prompt used if no contexts and conversation history available for hallucination evaluation	string	See Default Configuration above
`userPromptDefault`	User prompt used if no contexts and conversation history available for hallucination evaluation	string	See Default Configuration above

Conclusion

Hallucination evaluation is a critical component in the effective deployment of RAG-based systems. By categorizing responses into Low, Medium, and High hallucination levels, the system can provide users with valuable insights into the reliability of the generated content. This evaluation process not only enhances user confidence but also supports the responsible use of AI in contexts where accuracy is crucial.