Baseline Comparisons
While developing LLM Apps, it is important to compare your latest code changes against your system Baseline to understand if theese boost performance or introduce a regression.
- Integrating Lynxius into your CI/CD pipeline makes it straightforward to understand the quality of each code change and determine which should be accepted, or rejected. This helps identify issues early before they poising your code base.
- Integrating Lynxius into your production system make it straightforward to understand the quality of your running production istance to spot production incidents early and run Root Cause Analysis.
The Baseline is the status of your system across time, according to specific metrics like: answer correctness, BERTScore, answer completeness, factuality and context precision.
A team can monitor multiple Baselines depending on their development needs. Main-Branch Baseline and Production Baseline are the most used ones.
Main-Branch Baseline
The Main-Branch Baseline monitors the quality of your main
development branch and helps determine if a new Pull Request (PR) improves the performance of the system. This enables you to decide whether to merge it into the main
branch.
To ensure consistent comparisons across the Main-Branch Baseline and a PR, it is important that Evaluators' scores are calculated using the same input dataset. The last thing you want to do is comparing apples with oranges! Lynxius enables you to store your Dataset on the platform, allowing you to fetched it with a single command from your main
branch, or any other feature branch.
from lynxius.client import LynxiusClient
client = LynxiusClient(run_local=True)
# Download a dataset previously uploaded to the Lynxius Platform
dataset_details = client.get_dataset_details(dataset_id="YOUR_DATASET_UUID")
Production Baseline
The Production Baseline monitors the quality of your production system over real user data. It is useful in various scenarios, such as:
- Understanding if your application is experiencing issues in production;
- Detecting an unexpected surge of atypical user queries (edge cases) your application was not built to handle;
- Monitoring if the development efforts of your team (tracked by the Main-Branch Baseline) are improving the production system.
The Production Baseline is calculated in real-time using your real users' data on the running production system. Both the Production Baseline and Traces are stored on the Lynxius platform for developers and operations teams to quickly run root cause analysis to identify the reason of an incident or misbehaviour of your application.
Interpreting Baseline Comparisons
Baselines are very powerful. By examining the Trend plot displayed below, we can detect several key insights:
- The Production Baseline experienced a significant drop on June 17, 2024. Clicking on the faulty data point will take you directly to the Eval Run that caused the Production Incident for a quick inspection.
- The Main-Branch Baseline has been consistently above the Production Baseline for almost the entire month, as expected. However, there was an issue on July 3, 2024. It appears that a problematic PR was merged into the codebase. The current version of the
main
branch must not be deployed to production! - The CI/CD scoressuggest that a developer has already pushed a new PR that resolves the issue with the
main
branch. You should merge this latest PR tomain
branch before deploying to production! The direct Comparison between CI/CD and Main-Branch is even more evident on the CI/CD Project page.
Our chat_pizza
application is designed to answer pizza-related questions. However, upon inspecting the faulty Eval Run, we noticed that users asked multiple medical questions which our chatbot wasn't able to respond to.
The mean score value for the new PR is 0.89, which is greater than the latest value of the Main-Branch Baseline (0.37). It is clear that this PR can be merged to main
.
Structuring Baseline Comparison
To fully benefit from all the features of the Lynxius platform, we recommend creating a new baseline Project for each development stage you wish to monitor. Main-Branch Baseline and Production Baseline are the most commonly used, and we encourage creating separate projects for each.
An LLM App often comprises multiple tasks (e.g., information retrieval, text generation, summarization, etc.), and you likely want to Evaluate each task independently using different metrics and test sets. To facilitate this, we introduce the concept of baseline_eval_run_label
to distinguish the relevant tasks within a baseline Project.
The Quickstart Baselines guide offers a practical example of how to set up a Baseline.