GitHub - JudgmentLabs/judgeval: The open source post-building layer for agents. Our traces + evals power agent post-training (RL, SFT), monitoring, and regression testing.

Enable self-learning agents with traces, evals, and environment data.

Docs • Judgment Cloud • Self-Host • Landing Page

We're hiring! Join us in our mission to enable self-learning agents by providing the data and signals needed for monitoring and post-training.

Judgeval offers open-source tooling for tracing and evaluating autonomous, stateful agents. It provides runtime data from agent-environment interactions for continuous learning and self-improvement.

🎬 See Judgeval in Action

Multi-Agent System with complete observability: (1) A multi-agent system spawns agents to research topics on the internet. (2) With just 3 lines of code, Judgeval traces every input/output + environment response across all agent tool calls for debugging. (3) After completion, (4) export all interaction data to enable further environment-specific learning and optimization.

🤖 Agents Running	📊 Real-time Tracing
✅ Agents Completed Running	📤 Exporting Agent Environment Data

🛠️ Installation

Get started with Judgeval by installing our SDK using pip:

pip install judgeval

Ensure you have your JUDGMENT_API_KEY and JUDGMENT_ORG_ID environment variables set to connect to the Judgment Platform.

export JUDGMENT_API_KEY=...
export JUDGMENT_ORG_ID=...

If you don't have keys, create an account on the platform!

🏁 Quickstarts

🛰️ Tracing

Create a file named agent.py with the following code:

from judgeval.tracer import Tracer, wrap
from openai import OpenAI

client = wrap(OpenAI())  # tracks all LLM calls
judgment = Tracer(project_name="my_project")

@judgment.observe(span_type="tool")
def format_question(question: str) -> str:
    # dummy tool
    return f"Question : {question}"

@judgment.observe(span_type="function")
def run_agent(prompt: str) -> str:
    task = format_question(prompt)
    response = client.chat.completions.create(
        model="gpt-4.1",
        messages=[{"role": "user", "content": task}]
    )
    return response.choices[0].message.content
    
run_agent("What is the capital of the United States?")

You'll see your trace exported to the Judgment Platform:

Click here for a more detailed explanation.

✨ Features


🔍 Tracing Automatic agent tracing integrated with common frameworks (LangGraph, OpenAI, Anthropic). Tracks inputs/outputs, agent tool calls, latency, cost, and custom metadata at every step. Useful for: • 🐛 Debugging agent runs • 📋 Collecting agent environment data • 🔬 Pinpointing performance bottlenecks
🧪 Evals Build custom evaluators on top of your agents. Judgeval supports LLM-as-a-judge, manual labeling, and code-based evaluators that connect with our metric-tracking infrastructure. Useful for: • ⚠️ Unit-testing • 🔬 A/B testing • 🛡️ Online guardrails
📡 Monitoring Get Slack alerts for agent failures in production. Add custom hooks to address production regressions. Useful for: • 📉 Identifying degradation early • 📈 Visualizing performance trends across agent versions and time
📊 Datasets Export traces and test cases to datasets for scaled analysis and optimization. Move datasets to/from Parquet, S3, etc. Run evals on datasets as unit tests or to A/B test different agent configurations, enabling continuous learning from production interactions. Useful for: • 🗃️ Agent environment interaction data for optimization • 🔄 Scaled analysis for A/B tests

🏢 Self-Hosting

Run Judgment on your own infrastructure: we provide comprehensive self-hosting capabilities that give you full control over the backend and data plane that Judgeval interfaces with.

Key Features

Deploy Judgment on your own AWS account
Store data in your own Supabase instance
Access Judgment through your own custom domain

Getting Started

Check out our self-hosting documentation for detailed setup instructions, along with how your self-hosted instance can be accessed
Use the Judgment CLI to deploy your self-hosted environment
After your self-hosted instance is setup, make sure the JUDGMENT_API_URL environmental variable is set to your self-hosted backend endpoint

📚 Cookbooks

Have your own? We're happy to feature it if you create a PR or message us on Discord.

You can access our repo of cookbooks here.

💻 Development with Cursor

Building agents and LLM workflows in Cursor works best when your coding assistant has the proper context about Judgment integration. The Cursor rules file contains the key information needed for your assistant to implement Judgment features effectively.

Refer to the official documentation for access to the rules file and more information on integrating this rules file with your codebase.

⭐ Star Us on GitHub

If you find Judgeval useful, please consider giving us a star on GitHub! Your support helps us grow our community and continue improving the repository.

❤️ Contributors

There are many ways to contribute to Judgeval:

Submit bug reports and feature requests
Review the documentation and submit Pull Requests to improve it
Speaking or writing about Judgment and letting us know!

Judgeval is created and maintained by Judgment Labs.

Name		Name	Last commit message	Last commit date
Latest commit History 1,474 Commits
.github		.github
assets		assets
src		src
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE.md		LICENSE.md
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
update_version.py		update_version.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Docs • Judgment Cloud • Self-Host • Landing Page

🎬 See Judgeval in Action

📋 Table of Contents

🛠️ Installation

🏁 Quickstarts

🛰️ Tracing

✨ Features

🔍 Tracing

🧪 Evals

📡 Monitoring

📊 Datasets

🏢 Self-Hosting

Key Features

Getting Started

📚 Cookbooks

💻 Development with Cursor

⭐ Star Us on GitHub

❤️ Contributors

About

Uh oh!

Releases 22

Packages

Uh oh!

Contributors 22

Languages

License

JudgmentLabs/judgeval

Folders and files

Latest commit

History

Repository files navigation

Docs • Judgment Cloud • Self-Host • Landing Page

🎬 See Judgeval in Action

📋 Table of Contents

🛠️ Installation

🏁 Quickstarts

🛰️ Tracing

✨ Features

🔍 Tracing

🧪 Evals

📡 Monitoring

📊 Datasets

🏢 Self-Hosting

Key Features

Getting Started

📚 Cookbooks

💻 Development with Cursor

⭐ Star Us on GitHub

❤️ Contributors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 22

Packages 0

Uh oh!

Contributors 22

Languages

Packages