Skip to content

Commit 99ebc45

Browse files
authored
Break up evaluators table (#46090)
1 parent 56236f7 commit 99ebc45

File tree

1 file changed

+39
-24
lines changed

1 file changed

+39
-24
lines changed

docs/ai/conceptual/evaluation-libraries.md

Lines changed: 39 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: The Microsoft.Extensions.AI.Evaluation libraries
33
description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps.
44
ms.topic: concept-article
5-
ms.date: 05/09/2025
5+
ms.date: 05/13/2025
66
---
77
# The Microsoft.Extensions.AI.Evaluation libraries (Preview)
88

@@ -23,29 +23,44 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi
2323

2424
## Comprehensive evaluation metrics
2525

26-
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following table shows the built-in evaluators.
27-
28-
| Metric | Description | Evaluator type |
29-
|--------------|--------------------------------------------------------|----------------|
30-
| Relevance | Evaluates how relevant a response is to a query | `RelevanceEvaluator` <!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator> --> |
31-
| Completeness | Evaluates how comprehensive and accurate a response is | `CompletenessEvaluator` <!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator> --> |
32-
| Retrieval | Evaluates performance in retrieving information for additional context | `RetrievalEvaluator` <!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator> --> |
33-
| Fluency | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> |
34-
| Coherence | Evaluates the logical and orderly presentation of ideas | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> |
35-
| Equivalence | Evaluates the similarity between the generated text and its ground truth with respect to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> |
36-
| Groundedness | Evaluates how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator><br />`GroundednessProEvaluator` |
37-
| Protected material | Evaluates response for the presence of protected material | `ProtectedMaterialEvaluator` |
38-
| Ungrounded human attributes | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes | `UngroundedAttributesEvaluator` |
39-
| Hate content | Evaluates a response for the presence of content that's hateful or unfair | `HateAndUnfairnessEvaluator`|
40-
| Self-harm content | Evaluates a response for the presence of content that indicates self harm | `SelfHarmEvaluator`|
41-
| Violent content | Evaluates a response for the presence of violent content | `ViolenceEvaluator`|
42-
| Sexual content | Evaluates a response for the presence of sexual content | `SexualEvaluator`|
43-
| Code vulnerability content | Evaluates a response for the presence of vulnerable code | `CodeVulnerabilityEvaluator` |
44-
| Indirect attack content | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering | `IndirectAttackEvaluator` |
45-
46-
† In addition, the `ContentHarmEvaluator` provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`.
47-
48-
You can also customize to add your own evaluations by implementing the <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator> interface or extending the base classes such as <xref:Microsoft.Extensions.AI.Evaluation.Quality.ChatConversationEvaluator> and <xref:Microsoft.Extensions.AI.Evaluation.Quality.SingleNumericMetricEvaluator>.
26+
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators) and [safety](#safety-evaluators) evaluators and the metrics they measure.
27+
28+
You can also customize to add your own evaluations by implementing the <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator> interface.
29+
30+
### Quality evaluators
31+
32+
Quality evaluators measure response quality. They use an LLM to perform the evaluation.
33+
34+
| Metric | Description | Evaluator type |
35+
|----------------|--------------------------------------------------------|----------------|
36+
| `Relevance` | Evaluates how relevant a response is to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator> |
37+
| `Completeness` | Evaluates how comprehensive and accurate a response is | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator> |
38+
| `Retrieval` | Evaluates performance in retrieving information for additional context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator> |
39+
| `Fluency` | Evaluates grammatical accuracy, vocabulary range, sentence complexity, and overall readability| <xref:Microsoft.Extensions.AI.Evaluation.Quality.FluencyEvaluator> |
40+
| `Coherence` | Evaluates the logical and orderly presentation of ideas | <xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator> |
41+
| `Equivalence` | Evaluates the similarity between the generated text and its ground truth with respect to a query | <xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator> |
42+
| `Groundedness` | Evaluates how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator> |
43+
| `Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)` | Evaluates how relevant, truthful, and complete a response is | <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator>|
44+
45+
† This evaluator is marked [experimental](../../fundamentals/syslib-diagnostics/experimental-overview.md).
46+
47+
### Safety evaluators
48+
49+
Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations.
50+
51+
| Metric | Description | Evaluator type |
52+
|--------------------|-----------------------------------------------------------------------|------------------------------|
53+
| `Groundedness Pro` | Uses a fine-tuned model hosted behind the Azure AI Foundry Evaluation service to evaluate how well a generated response aligns with the given context | <xref:Microsoft.Extensions.AI.Evaluation.Safety.GroundednessProEvaluator> |
54+
| `Protected Material` | Evaluates response for the presence of protected material | <xref:Microsoft.Extensions.AI.Evaluation.Safety.ProtectedMaterialEvaluator> |
55+
| `Ungrounded Attributes` | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes | <xref:Microsoft.Extensions.AI.Evaluation.Safety.UngroundedAttributesEvaluator> |
56+
| `Hate And Unfairness` | Evaluates a response for the presence of content that's hateful or unfair | <xref:Microsoft.Extensions.AI.Evaluation.Safety.HateAndUnfairnessEvaluator>|
57+
| `Self Harm` | Evaluates a response for the presence of content that indicates self harm | <xref:Microsoft.Extensions.AI.Evaluation.Safety.SelfHarmEvaluator>|
58+
| `Violence` | Evaluates a response for the presence of violent content | <xref:Microsoft.Extensions.AI.Evaluation.Safety.ViolenceEvaluator>|
59+
| `Sexual` | Evaluates a response for the presence of sexual content | <xref:Microsoft.Extensions.AI.Evaluation.Safety.SexualEvaluator>|
60+
| `Code Vulnerability` | Evaluates a response for the presence of vulnerable code | <xref:Microsoft.Extensions.AI.Evaluation.Safety.CodeVulnerabilityEvaluator> |
61+
| `Indirect Attack` | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering | <xref:Microsoft.Extensions.AI.Evaluation.Safety.IndirectAttackEvaluator> |
62+
63+
† In addition, the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentHarmEvaluator> provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`.
4964

5065
## Cached responses
5166

0 commit comments

Comments
 (0)