BERTScore

BERTScore, introduced by Zhang, Tianyi, et al.¹, evaluates text generation quality, making it ideal for tasks like text summarization and machine translation. It computes a similarity score for each token in the candidate sentence (generated text) with each token in the reference (ground truth) sentence. Instead of exact matches, token similarity is computed using embeddings generated with a Text Embedding Model. In this way the context in which words appear is taken into consideration, unlike with traditional metrics like BLEU or ROUGE, which rely on n-gram overlap and can be limited in capturing semantic meaning.

Calculation

The BERTScore computation involves the following steps:

Embedding Extraction: Extract embeddings for each word (or sentence) token in the output and reference.
Similarity Calculation: Compute the cosine similarity between each word (or sentence) token embedding in the output and each word (or sentence) token embedding in the reference.
Matching Tokens: For each word (or sentence) token in the output, find the highest similarity score with any word (or sentence) token in the reference (precision). Similarly, for each word (or sentence) token in the reference, find the highest similarity score with any word (or sentence) token in the output (recall).

The formulas for precision \(P\), recall \(R\), and F1 score \(F1\) are as follows:

\[ P=\frac{1}{c}\sum_{x \in c}\max_{y \in r}sim(x,y) \]

\[ R=\frac{1}{r}\sum_{y \in r}\max_{x \in c}sim(y,x) \]

\[ F1=2\cdot \frac{P \cdot R}{P + R} \]

where:

\(c\) is the set of embeddings for the output word (or sentence) tokens.
\(r\) is the set of embeddings for the reference word (or sentence) tokens.
\(sim(x,y)\) is the cosine similarity between embeddings \(x\) and \(y\).

Example

Tip

Please consult our full Swagger API documentation to run this evaluator via APIs.

from lynxius.client import LynxiusClient
from lynxius.evals.bert_score import BertScore

client = LynxiusClient()

# add tags for frontend filtering
label = "PR #111"
tags = ["GPT-4", "chat_pizza", "summarization", "PROD", "Pizza-DB:v2"]
bert_score = BertScore(
    label=label,
    tags=tags,
    level="word",
    presence_threshold=0.55
)

bert_score.add_trace(
    # reference from Wikipedia (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/san_marzano_wikipedia_reference.png)
    reference=(
        "The tomato sauce of Neapolitan pizza must be made with San Marzano "
        "tomatoes or pomodorini del Piennolo del Vesuvio."
    ),
    # output from OpenAI GPT-4 (https://github.com/lynxius/lynxius-docs/blob/main/docs/public/images/san_marzano_gpt4_output.png)
    output=(
        "San Marzano tomatoes are traditionally used in Neapolitan pizza sauce."
    )
)

client.evaluate(bert_score)

Click on the Eval Run link of your project to explore the output of your evaluation. Result UI Screenshot tab below shows the result on the UI, while the Result Values provides an explanation.

Result (UI Screenshot)Result (Values)

bert_score_eval_run

Score	Value	Interpretation
Precision	0.83029	Many tokens in the `output` are similar to the `reference` text.
Recall	0.69723	A good part of the tokens in the `reference` are similar to the `output` text.
F1 Score	0.75796	Both `precision` and `recall` are good.
Missing Tokens	The tomato sauce of Neapolitan pizza must be made with San Marzano tomatoes or pomodorini del Piennolo del Vesuvio	The words (or sentences) highlighted in red are present in the `reference`, but missing in the `output`. Comparisons in this example are based on words, as determined by the input argument `level`.

Inputs & Outputs

ArgsReturns

Args
label	A `str` that represents the current Eval Run. This is ideally the number of the pull request that run the evaluator.
href	A `str` representing a URL that gets associated to the `label` on the Lynxius platform. This ideally points to the pull request that run the evaluator.
tags	A `list[str]` of tags for filtering on UI Eval Runs.
level	A `str` that can be any of `[word \| sentence]`. `word` is used to identify words in the reference text with a similarity score below `presence_threshold` when compared to other words in the candidate text. `sentence` is used to identify entire sentences in the reference text with a similarity score below `presence_threshold` when compared to other sentences in the candidate text.
presence_threshold	Token comparisons between `reference` and `output` with similarity scores below this `float` threshold are returned among the `missing_tokens`.
data	An instance of `ReferenceOutputPair`.

Returns
uuid	The UUID of this Eval Run.
precision	A `float` in the range [0.0, 1.0] that measures how many of the word (or sentence) tokens in the `output` are similar to any word (or sentence) token in the `reference`. Comparisons can be run based on words or sentences depending on the input argument `level`. A score of 1.0 indicates that all tokens in the `output` are perfectly similar to the `reference`, while a score of 0.0 indicates no tokens in the `output` are similar to the `reference`.
recall	A `float` in the range [0.0, 1.0] that measures how many of the word (or sentence) tokens in the `reference` are found in the `output`. Comparisons can be run based on words or sentences depending on the input argument `level`. A score of 1.0 indicates that all tokens in the `reference` are perfectly similar to the `output`, while a score of 0.0 indicates no tokens in the `reference` are similar to the `output`.
f1	A `float` in the range [0.0, 1.0] that represents the overall similarity between the `output` and `reference` words (or sentences). Comparisons can be run based on words or sentences depending on the input argument `level`. A score of 1.0 indicates perfect overall similarity between `output` and `reference`, while a score of 0.0 indicates no overall similarity between `output` and `reference`.
missing_tokens	A `list[tuple[int, int]]` of token spans representing the word (or sentence) tokens in the `reference` text that are missing in the `output` text. A token in the `reference` that has similarity score below `presence_threshold` when compared to a token in the `output` is classified as missing in the `output`. A token span represents the start and end positions of the token within the `reference` text. Comparisons can be run based on words or sentences depending on the input argument `level`.

Zhang, Tianyi, et al. "Bertscore: Evaluating text generation with bert." arXiv preprint arXiv:1904.09675 (2019). Available at: https://arxiv.org/abs/1904.09675 ↩