You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/ai/conceptual/evaluation-libraries.md
+39-24Lines changed: 39 additions & 24 deletions
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@
2
2
title: The Microsoft.Extensions.AI.Evaluation libraries
3
3
description: Learn about the Microsoft.Extensions.AI.Evaluation libraries, which simplify the process of evaluating the quality and accuracy of responses generated by AI models in .NET intelligent apps.
4
4
ms.topic: concept-article
5
-
ms.date: 05/09/2025
5
+
ms.date: 05/13/2025
6
6
---
7
7
# The Microsoft.Extensions.AI.Evaluation libraries (Preview)
8
8
@@ -23,29 +23,44 @@ The libraries are designed to integrate smoothly with existing .NET apps, allowi
23
23
24
24
## Comprehensive evaluation metrics
25
25
26
-
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following table shows the built-in evaluators.
| Relevance | Evaluates how relevant a response is to a query |`RelevanceEvaluator`<!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator> -->|
31
-
| Completeness | Evaluates how comprehensive and accurate a response is |`CompletenessEvaluator`<!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator> -->|
32
-
| Retrieval | Evaluates performance in retrieving information for additional context |`RetrievalEvaluator`<!-- <xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator> -->|
| Coherence | Evaluates the logical and orderly presentation of ideas |<xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator>|
35
-
| Equivalence | Evaluates the similarity between the generated text and its ground truth with respect to a query |<xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator>|
36
-
| Groundedness | Evaluates how well a generated response aligns with the given context |<xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator><br />`GroundednessProEvaluator`|
37
-
| Protected material | Evaluates response for the presence of protected material |`ProtectedMaterialEvaluator`|
38
-
| Ungrounded human attributes | Evaluates a response for the presence of content that indicates ungrounded inference of human attributes |`UngroundedAttributesEvaluator`|
39
-
| Hate content | Evaluates a response for the presence of content that's hateful or unfair |`HateAndUnfairnessEvaluator`† |
40
-
| Self-harm content | Evaluates a response for the presence of content that indicates self harm |`SelfHarmEvaluator`† |
41
-
| Violent content | Evaluates a response for the presence of violent content |`ViolenceEvaluator`† |
42
-
| Sexual content | Evaluates a response for the presence of sexual content |`SexualEvaluator`† |
43
-
| Code vulnerability content | Evaluates a response for the presence of vulnerable code |`CodeVulnerabilityEvaluator`|
44
-
| Indirect attack content | Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering |`IndirectAttackEvaluator`|
45
-
46
-
† In addition, the `ContentHarmEvaluator` provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`.
47
-
48
-
You can also customize to add your own evaluations by implementing the <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator> interface or extending the base classes such as <xref:Microsoft.Extensions.AI.Evaluation.Quality.ChatConversationEvaluator> and <xref:Microsoft.Extensions.AI.Evaluation.Quality.SingleNumericMetricEvaluator>.
26
+
The evaluation libraries were built in collaboration with data science researchers from Microsoft and GitHub, and were tested on popular Microsoft Copilot experiences. The following sections show the built-in [quality](#quality-evaluators) and [safety](#safety-evaluators) evaluators and the metrics they measure.
27
+
28
+
You can also customize to add your own evaluations by implementing the <xref:Microsoft.Extensions.AI.Evaluation.IEvaluator> interface.
29
+
30
+
### Quality evaluators
31
+
32
+
Quality evaluators measure response quality. They use an LLM to perform the evaluation.
|`Relevance`| Evaluates how relevant a response is to a query |<xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceEvaluator>|
37
+
|`Completeness`| Evaluates how comprehensive and accurate a response is |<xref:Microsoft.Extensions.AI.Evaluation.Quality.CompletenessEvaluator>|
38
+
|`Retrieval`| Evaluates performance in retrieving information for additional context |<xref:Microsoft.Extensions.AI.Evaluation.Quality.RetrievalEvaluator>|
|`Coherence`| Evaluates the logical and orderly presentation of ideas |<xref:Microsoft.Extensions.AI.Evaluation.Quality.CoherenceEvaluator>|
41
+
|`Equivalence`| Evaluates the similarity between the generated text and its ground truth with respect to a query |<xref:Microsoft.Extensions.AI.Evaluation.Quality.EquivalenceEvaluator>|
42
+
|`Groundedness`| Evaluates how well a generated response aligns with the given context |<xref:Microsoft.Extensions.AI.Evaluation.Quality.GroundednessEvaluator>|
43
+
|`Relevance (RTC)`, `Truth (RTC)`, and `Completeness (RTC)`| Evaluates how relevant, truthful, and complete a response is |<xref:Microsoft.Extensions.AI.Evaluation.Quality.RelevanceTruthAndCompletenessEvaluator>† |
44
+
45
+
† This evaluator is marked [experimental](../../fundamentals/syslib-diagnostics/experimental-overview.md).
46
+
47
+
### Safety evaluators
48
+
49
+
Safety evaluators check for presence of harmful, inappropriate, or unsafe content in a response. They rely on the Azure AI Foundry Evaluation service, which uses a model that's fine tuned to perform evaluations.
|`Groundedness Pro`| Uses a fine-tuned model hosted behind the Azure AI Foundry Evaluation service to evaluate how well a generated response aligns with the given context |<xref:Microsoft.Extensions.AI.Evaluation.Safety.GroundednessProEvaluator>|
54
+
|`Protected Material`| Evaluates response for the presence of protected material |<xref:Microsoft.Extensions.AI.Evaluation.Safety.ProtectedMaterialEvaluator>|
55
+
|`Ungrounded Attributes`| Evaluates a response for the presence of content that indicates ungrounded inference of human attributes |<xref:Microsoft.Extensions.AI.Evaluation.Safety.UngroundedAttributesEvaluator>|
56
+
|`Hate And Unfairness`| Evaluates a response for the presence of content that's hateful or unfair |<xref:Microsoft.Extensions.AI.Evaluation.Safety.HateAndUnfairnessEvaluator>† |
57
+
|`Self Harm`| Evaluates a response for the presence of content that indicates self harm |<xref:Microsoft.Extensions.AI.Evaluation.Safety.SelfHarmEvaluator>† |
58
+
|`Violence`| Evaluates a response for the presence of violent content |<xref:Microsoft.Extensions.AI.Evaluation.Safety.ViolenceEvaluator>† |
59
+
|`Sexual`| Evaluates a response for the presence of sexual content |<xref:Microsoft.Extensions.AI.Evaluation.Safety.SexualEvaluator>† |
60
+
|`Code Vulnerability`| Evaluates a response for the presence of vulnerable code |<xref:Microsoft.Extensions.AI.Evaluation.Safety.CodeVulnerabilityEvaluator>|
61
+
|`Indirect Attack`| Evaluates a response for the presence of indirect attacks, such as manipulated content, intrusion, and information gathering |<xref:Microsoft.Extensions.AI.Evaluation.Safety.IndirectAttackEvaluator>|
62
+
63
+
† In addition, the <xref:Microsoft.Extensions.AI.Evaluation.Safety.ContentHarmEvaluator> provides single-shot evaluation for the four metrics supported by `HateAndUnfairnessEvaluator`, `SelfHarmEvaluator`, `ViolenceEvaluator`, and `SexualEvaluator`.
0 commit comments