Skip to content

Answer Correctness

Info

Answer Correctness measures correctness and completeness of the generated answer.

Answer Correctness evaluates the accuracy of the generated answer when compared to the reference (ground truth) one, making it ideal for tasks like question answering. It returns a score from 0 to 1, with higher scores indicate better alignment, focusing on factual correctness. While our approach is inspired by Ragas1, it incorporates distinct methodologies.

Calculation

Factual correctness quantifies the factual overlap between the generated output and the reference ( ground truth) answer. This is done using the concepts of:

  • TP (True Positive): Clauses or statements present in the generated output that are also directly supported by one or more clauses or statements in the reference.
  • FP (False Positive): Clauses or statements present in the generated output but not directly supported by any statement in the reference.
  • FN (False Negative): Clauses or statements present in the reference but not present in the generated output.

The formulas for precision \(P\), recall \(R\), and F1 score \(F1\) are as follows:

\[ P=\frac{TP}{TP + FP} \]
\[ R=\frac{TP}{TP + FN} \]
\[ F1=\frac{|TP|}{(|TP| + 0.5 \cdot (|FP| + |FN|))} \]

Example

Tip

Please consult our full Swagger API documentation to run this evaluator via APIs.

from lynxius.client import LynxiusClient
from lynxius.evals.answer_correctness import AnswerCorrectness

client = LynxiusClient()

# add tags for frontend filtering
label = "PR #111"
tags = ["GPT-4", "chat_pizza", "q_answering", "PROD", "Pizza-DB:v2"]
answer_correctness = AnswerCorrectness(label=label, tags=tags)

answer_correctness.add_trace(
    query="What is pizza quattro stagioni? Keep it short.",
    # reference from Wikipedia (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/quattro_stagioni_wikipedia_reference.png)
    reference=(
        "Pizza quattro stagioni ('four seasons pizza') is a variety of pizza "
        "in Italian cuisine that is prepared in four sections with diverse "
        "ingredients, with each section representing one season of the year. "
        "Artichokes represent spring, tomatoes or basil represent summer, "
        "mushrooms represent autumn and the ham, prosciutto or olives represent "
        "winter."
    ),
    # output from OpenAI GPT-4 (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/quattro_stagioni_gpt4_output.png)
    output=(
        "Pizza Quattro Stagioni is an Italian pizza that represents the four "
        "seasons through its toppings, divided into four sections. Each section "
        "features ingredients typical of a particular season, like artichokes "
        "for spring, peppers for summer, mushrooms for autumn, and olives or "
        "prosciutto for winter."
    )
)

client.evaluate(answer_correctness)

Click on the Eval Run link of your project to explore the output of your evaluation. Result UI Screenshot tab below shows the result on the UI, while the Result Values provides an explanation.

answer_correctness_eval_run

Score Value Interpretation
Score 0.66667 The output is only partially correct when compared to the reference.
Evaluator Output
{
    "TP": [
        "Pizza Quattro Stagioni is an Italian pizza that represents the four seasons through its toppings, divided into four sections.",
        "Each section features ingredients typical of a particular season, like artichokes for spring, mushrooms for autumn, and olives or prosciutto for winter."
    ],
    "FP": [
        "peppers for summer"
    ],
    "FN": [
        "tomatoes or basil represent summer"
    ]
}
Two True Positives, one False Positive and one False Negative have been detected.

Inputs & Outputs

Args
label A str that represents the current Eval Run. This is ideally the number of the pull request that run the evaluator.
href A str representing a URL that gets associated to the label on the Lynxius platform. This ideally points to the pull request that run the evaluator.
tags A list[str] of tags for filtering on UI Eval Runs.
data An instance of QueryReferenceOutputTriplet.
Returns
uuid The UUID of this Eval Run.
precision A float in the range [0.0, 1.0] that indicates how many of the retrieved clauses or statements are correct and complete. A score of 1.0 indicates that all retrieved clauses or statements in the output are correct and complete with respect to the reference, while a score of 0.0 indicates none of the retrieved clauses or statements in the output are correct and complete with respect to the reference.
recall A float in the range [0.0, 1.0] that indicates how many relevant clauses or statements are retrieved. A score of 1.0 indicates that all relevant clauses or statements in the reference are retrieved in the output, while a score of 0.0 indicates none of the relevant clauses or statements in the reference are retrieved in the output.
f1 A float in the range [0.0, 1.0] that represents the overall accuracy of the output compared to the reference. A score of 1.0 indicates perfect overall accuracy in the output compared to the reference, while a score of 0.0 indicates no overall accuracy in the output compared to the reference.

  1. ‘Answer Correctness’ (2024) Ragas. Available at: https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html (Accessed: May 24, 2024)