Evaluation Runs (Eval Runs)
Evaluation runs (or "Eval runs") are the cornerstone of the Lynxius platform. They enable you to evaluate the quality of your LLM Apps using our comprehensive set of evaluators, helping you determine if your development efforts are enhancing the quality of your LLM products.
With Bulk Evaluations you can run a set of evaluators over numerous data entries in parallel. This is an efficient method for evaluating multiple input/output pairs and edge cases. A Dataset is a collection of inputs and expected outputs of an LLM App that you can use to run bulk evaluations.
Tags can be used to filter evaluation runs based on various criteria, while trends can be used to plot their aggregate behaviour.
What are Evaluation Runs?
Run an Eval Run
To run an Eval Run when using the Lynxius Python library you need to call the evaluate()
method of the LynxiusClient
class.
Example-1 (Single Evaluation)
In this example we run an evaluation run on a single data point.
from lynxius.client import LynxiusClient
from lynxius.evals.answer_correctness import AnswerCorrectness
client = LynxiusClient()
label = "PR #111"
tags = ["GPT-4", "chat_pizza", "Pizza-DB:v2"]
answer_correctness = AnswerCorrectness(label=label, tags=tags)
answer_correctness.add_trace(
query="Query here",
reference="Human label here",
output="Your LLM App actual output here"
)
client.evaluate(answer_correctness)
Example-2 (Bulk Evaluation)
In this example we run bulk evaluations over a dataset downloaded from the Lynxius platform.
from lynxius.client import LynxiusClient
from lynxius.evals.answer_correctness import AnswerCorrectness
client = LynxiusClient()
# Downloading Dataset from Lynxius platform
dataset_details = client.get_dataset_details(
dataset_id="d80e9f20-2590-4f95-ac7d-71b9d9c4bb20"
)
label = "PR #111"
tags = ["GPT-4", "chat_pizza", "Pizza-DB:v2"]
answer_correctness = AnswerCorrectness(label=label, tags=tags)
for entry in dataset_details.entries:
answer_correctness.add_trace(
query=entry.query,
reference=entry.reference,
output=actual_output,
context=context
)
client.evaluate(answer_correctness)
Inspect Eval Runs
All the Eval Run for a project are visible after clicking the button for the desired project on the projects page. You can use the search-bar to filter based on tags.
This page displays high-level metrics for each Eval Run like:
- Run Status, which indicates whether the Eval Run is finished, ongoing, or failed.
- Mean, which is useful for bulk evaluations and represents the average score over all entries in the dataset.
- p20, which is useful for bulk evaluations and represents the score below which 20% of the dataset entries fall. It highlights the worst-performing 20% entries.
- p90, which is useful for bulk evaluations and represents the score below which 90% of the dataset entries fall. It highlights the performance of the majority of your data, as it excludes the top 10% of high-performing entries.
Clicking on the ID of a specific Eval Run opens a page with more detailed information about each evaluation computed on each data entry of the bulk evaluation. Here, the values of each Dataset Entry are shown, along with the evaluators' scores for that entry.
Add Data Entry Edge Case to Datasets
If an edge case is detected among the data entries during an Eval Run Inspection of a bulk evaluation, it can be immediately added to a dataset of your choice. Simply click the button and select the relevant dataset. Once the specific input/output pair is in your dataset, it can be reviewed and annotated by a human.