Answer Correctness

Info

Answer Correctness measures correctness and completeness of the generated answer.

Answer Correctness evaluates the accuracy of the generated answer when compared to the reference (ground truth) one, making it ideal for tasks like question answering. It returns a score from 0 to 1, with higher scores indicate better alignment, focusing on factual correctness. While our approach is inspired by Ragas¹, it incorporates distinct methodologies.

Calculation

Factual correctness quantifies the factual overlap between the generated output and the reference ( ground truth) answer. This is done using the concepts of:

TP (True Positive): Clauses or statements present in the generated output that are also directly supported by one or more clauses or statements in the reference.
FP (False Positive): Clauses or statements present in the generated output but not directly supported by any statement in the reference.
FN (False Negative): Clauses or statements present in the reference but not present in the generated output.

The formulas for precision \(P\), recall \(R\), and F1 score \(F1\) are as follows:

\[ P=\frac{TP}{TP + FP} \]

\[ R=\frac{TP}{TP + FN} \]

\[ F1=\frac{|TP|}{(|TP| + 0.5 \cdot (|FP| + |FN|))} \]

Example

Tip

Please consult our full Swagger API documentation to run this evaluator via APIs.

from lynxius.client import LynxiusClient
from lynxius.evals.answer_correctness import AnswerCorrectness

client = LynxiusClient()

# add tags for frontend filtering
label = "PR #111"
tags = ["GPT-4", "chat_pizza", "q_answering", "PROD", "Pizza-DB:v2"]
answer_correctness = AnswerCorrectness(label=label, tags=tags)

answer_correctness.add_trace(
    query="What is pizza quattro stagioni? Keep it short.",
    # reference from Wikipedia (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/quattro_stagioni_wikipedia_reference.png)
    reference=(
        "Pizza quattro stagioni ('four seasons pizza') is a variety of pizza "
        "in Italian cuisine that is prepared in four sections with diverse "
        "ingredients, with each section representing one season of the year. "
        "Artichokes represent spring, tomatoes or basil represent summer, "
        "mushrooms represent autumn and the ham, prosciutto or olives represent "
        "winter."
    ),
    # output from OpenAI GPT-4 (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/quattro_stagioni_gpt4_output.png)
    output=(
        "Pizza Quattro Stagioni is an Italian pizza that represents the four "
        "seasons through its toppings, divided into four sections. Each section "
        "features ingredients typical of a particular season, like artichokes "
        "for spring, peppers for summer, mushrooms for autumn, and olives or "
        "prosciutto for winter."
    )
)

client.evaluate(answer_correctness)

Click on the Eval Run link of your project to explore the output of your evaluation. Result UI Screenshot tab below shows the result on the UI, while the Result Values provides an explanation.

Result (UI Screenshot)Result (Values)

answer_correctness_eval_run

Score Value Interpretation

Score 0.66667 The output is only partially correct when compared to the reference.

Evaluator Output

{
    "TP": [
        "Pizza Quattro Stagioni is an Italian pizza that represents the four seasons through its toppings, divided into four sections.",
        "Each section features ingredients typical of a particular season, like artichokes for spring, mushrooms for autumn, and olives or prosciutto for winter."
    ],
    "FP": [
        "peppers for summer"
    ],
    "FN": [
        "tomatoes or basil represent summer"
    ]
}

Two True Positives, one False Positive and one False Negative have been detected.

Inputs & Outputs

ArgsReturns

Args
label	A `str` that represents the current Eval Run. This is ideally the number of the pull request that run the evaluator.
href	A `str` representing a URL that gets associated to the `label` on the Lynxius platform. This ideally points to the pull request that run the evaluator.
tags	A `list[str]` of tags for filtering on UI Eval Runs.
data	An instance of `QueryReferenceOutputTriplet`.

Returns
uuid	The UUID of this Eval Run.
precision	A `float` in the range [0.0, 1.0] that indicates how many of the retrieved clauses or statements are correct and complete. A score of 1.0 indicates that all retrieved clauses or statements in the `output` are correct and complete with respect to the `reference`, while a score of 0.0 indicates none of the retrieved clauses or statements in the `output` are correct and complete with respect to the `reference`.
recall	A `float` in the range [0.0, 1.0] that indicates how many relevant clauses or statements are retrieved. A score of 1.0 indicates that all relevant clauses or statements in the `reference` are retrieved in the `output`, while a score of 0.0 indicates none of the relevant clauses or statements in the `reference` are retrieved in the `output`.
f1	A `float` in the range [0.0, 1.0] that represents the overall accuracy of the `output` compared to the `reference`. A score of 1.0 indicates perfect overall accuracy in the `output` compared to the `reference`, while a score of 0.0 indicates no overall accuracy in the `output` compared to the `reference`.

‘Answer Correctness’ (2024) Ragas. Available at: https://docs.ragas.io/en/stable/concepts/metrics/answer_correctness.html (Accessed: May 24, 2024) ↩