Quickstart Evaluation

Evaluate the quality of your LLM App with Lynxius in just one minute! Ship AI with confidence following this quickstart guide.

Signup for free here to create an account. Don't forget to validate your email!

Install Lynxius library

Install Lynxius Python library locally, or on your server.

pip install lynxius

Create Your First Project and API Key

Create your first project and create an API Key for it to start your LLM evaluations. Rememeber to store your secret key somewhere safe, you won't be able to access to its value anymore.

Use the newly generated secret key to set up your project API key locally, or on your server, to start your LLM evaluations.

LinuxMacOSWindows

export LYNXIUS_API_KEY='your-api-key-here'

export LYNXIUS_API_KEY='your-api-key-here'

set LYNXIUS_API_KEY "your-api-key-here"

Select Evaluation Setup

Remote vs Local Evaluations

The Lynxius remote evaluation setup is a fully managed service where the API keys for the models used in testing are provided and managed by the Lynxius team. Evaluation tasks are executed in the background, freeing you from delays and compute costs.

The Lynxius local evaluation setup is convenient if you prefer to use your OpenAI free credits and are comfortable managing the API keys for the models used in testing.

If you plan to run evaluations remotely, you can skip to Run Evals Remotely. If you prefer to run evaluations locally, make sure your OpenAI API key is set in your environment.

LinuxMacOSWindows

export OPENAI_API_KEY='your-api-key-here'  # only needed to run evals locally

export OPENAI_API_KEY='your-api-key-here'  # only needed to run evals locally

set OPENAI_API_KEY "your-api-key-here"  :: only needed to run evals locally

Run Evaluations

We want to use Answer Correctness and Semantic Similarity evaluators from our library to evaluate the quality of the chat_pizza LLM APP.

Run Evals RemotelyRun Evals Locally

# import test dataset
# `chat_pizza` LLM App used OpenAI GPT-4 to produce its outputs (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/)
dataset = [
    {
        "query": "What is the first tomato-topped pizza? Keep it short.",
        "reference": (
            "Pizza marinara is supposedly the oldest tomato-topped pizza."
        ),
        "output": (
            "The first tomato-topped pizza is the Margherita, created in "
            "1889 in Naples, Italy."
        ),
    },
    {
        "query": "When did pizza arrive in the United States? Keep it short.",
        "reference": (
            "The first pizzeria in the U.S. was opened in New York City's "
            "Little Italy in 1905."
        ),
        "output": (
            "Pizza arrived in the United States in late 19th century, brought "
            "by Italian immigrants."
        ),
    },
    {
        "query": "Which tomato sauce is used in neapolitan pizza? Keep it short.",
        "reference": (
            "The tomato sauce of Neapolitan pizza must be made with San Marzano "
            "tomatoes or pomodorini del Piennolo del Vesuvio."
        ),
        "output": (
            "San Marzano tomatoes are traditionally used in Neapolitan pizza "
            "sauce."
        ),
    }
]

# start evaluating!
from lynxius.client import LynxiusClient
from lynxius.evals.answer_correctness import AnswerCorrectness
from lynxius.evals.semantic_similarity import SemanticSimilarity

client = LynxiusClient()

# add tags for frontend filtering
label = "PR #111"
tags = ["GPT-4", "chat_pizza", "q_answering", "PROD", "Pizza-DB:v2"]
answer_correctness = AnswerCorrectness(label=label, tags=tags)
semantic_similarity = SemanticSimilarity(label=label, tags=tags)


for entry in dataset:
    answer_correctness.add_trace(
        query=entry["query"],
        reference=entry["reference"],
        output=entry["output"],
        context=[]
    )
    semantic_similarity.add_trace(
        reference=entry["reference"],
        output=entry["output"],
        context=[]
    )

client.evaluate(answer_correctness)
client.evaluate(semantic_similarity)

# import test dataset
# `chat_pizza` LLM App used OpenAI GPT-4 to produce its outputs (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/)
dataset = [
    {
        "query": "What is the first tomato-topped pizza? Keep it short.",
        "reference": (
            "Pizza marinara is supposedly the oldest tomato-topped pizza."
        ),
        "output": (
            "The first tomato-topped pizza is the Margherita, created in "
            "1889 in Naples, Italy."
        ),
    },
    {
        "query": "When did pizza arrive in the United States? Keep it short.",
        "reference": (
            "The first pizzeria in the U.S. was opened in New York City's "
            "Little Italy in 1905."
        ),
        "output": (
            "Pizza arrived in the United States in late 19th century, brought "
            "by Italian immigrants."
        ),
    },
    {
        "query": "Which tomato sauce is used in neapolitan pizza? Keep it short.",
        "reference": (
            "The tomato sauce of Neapolitan pizza must be made with San Marzano "
            "tomatoes or pomodorini del Piennolo del Vesuvio."
        ),
        "output": (
            "San Marzano tomatoes are traditionally used in Neapolitan pizza "
            "sauce."
        ),
    }
]

# start evaluating!
from lynxius.client import LynxiusClient
from lynxius.evals.answer_correctness import AnswerCorrectness
from lynxius.evals.semantic_similarity import SemanticSimilarity

client = LynxiusClient(run_local=True)  # run evals locally

# add tags for frontend filtering
label = "PR #111"
tags = ["GPT-4", "chat_pizza", "q_answering", "PROD", "Pizza-DB:v2"]
answer_correctness = AnswerCorrectness(label=label, tags=tags)
semantic_similarity = SemanticSimilarity(label=label, tags=tags)


for entry in dataset:
    answer_correctness.add_trace(
        query=entry["query"],
        reference=entry["reference"],
        output=entry["output"],
        context=[]
    )
    semantic_similarity.add_trace(
        reference=entry["reference"],
        output=entry["output"],
        context=[]
    )

client.evaluate(answer_correctness)
client.evaluate(semantic_similarity)

View Results

Click on the Eval Run link of your project to explore the output of your evaluation. The Eval Runs page reports the average scores of each metric over the entire test dataset. Result UI Screenshot tabs below shows the result on the UI, while the Result Values provide an explanation.

Result (UI Screenshot)Result (Values)

quickstart_eval_aggregated_scores

Evaluator	Aggregated Value	Interpretation
Semantic Similarity	0.72	On average, the `output` is semantically similar to the `reference` across the input dataset.
Answer Correctness	0.22	On average, the `output` is minimally correct when compared to the `reference` across the input dataset.

Click on the Answer Correctness entry on the Eval Runs page for more granular information on the score of each dataset row. The last row result is described below.

Result (UI Screenshot)Result (Values)

quickstart_eval_answer_correctness_eval_run

Score Value Interpretation

Score 0.66667 The output is only partially correct when compared to the reference.

Evaluator Output

{
    "TP": [
        "San Marzano tomatoes are traditionally used in Neapolitan pizza sauce.",
    ],
    "FP": [],
    "FN": [
        "The tomato sauce of Neapolitan pizza must be made with pomodorini del Piennolo del Vesuvio."
    ]
}

One True Positives and one False Negative have been detected.

Click on the Semantic Similarity entry on the Eval Runs page for more granular information on the score of each dataset row. The last row result is described below.

Result (UI Screenshot)Result (Values)

quickstart_eval_semantic_similarity_eval_run

Evaluator	Aggregated Value	Interpretation
Similarity	0.82486	The `output` is semantically similar to the `reference`, although the `output` omits mentioning the second type of tomatoes allowed, "pomodorini del Piennolo del Vesuvio."

Conclusion

chat_pizza average Semantic Similarity over the input test dataset is 0.72, however the average Answer Correctness over the input test dataset is only 0.22.

It's time to roll up our sleeves and improve the quality of chat_pizza LLM App! See how on our blog post article.