Skip to content
/ unitxt Public

πŸ¦„ Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

License

Notifications You must be signed in to change notification settings

IBM/unitxt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Image Description

version license python tests Coverage Status Read the Docs downloads

πŸ¦„ Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

Why Unitxt?

  • 🌐 Comprehensive: Evaluate text, tables, vision, speech, and code in one unified framework
  • πŸ’Ό Enterprise-Ready: Battle-tested components with extensive catalog of benchmarks
  • 🧠 Model Agnostic: Works with HuggingFace, OpenAI, WatsonX, and custom models
  • πŸ”’ Reproducible: Shareable, modular components ensure consistent results

Quick Links

Installation

pip install unitxt

Quick Start

Command Line Evaluation

# Simple evaluation
unitxt-evaluate \
    --tasks "card=cards.mmlu_pro.engineering" \
    --model cross_provider \
    --model_args "model_name=llama-3-1-8b-instruct" \
    --limit 10

# Multi-task evaluation
unitxt-evaluate \
    --tasks "card=cards.text2sql.bird+card=cards.mmlu_pro.engineering" \
    --model cross_provider \
    --model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
    --split test \
    --limit 10 \
    --output_path ./results/evaluate_cli \
    --log_samples \
    --apply_chat_template

# Benchmark evaluation
unitxt-evaluate \
    --tasks "benchmarks.tool_calling" \
    --model cross_provider \
    --model_args "model_name=llama-3-1-8b-instruct,max_tokens=256" \
    --split test \
    --limit 10 \
    --output_path ./results/evaluate_cli \
    --log_samples \
    --apply_chat_template

Loading as Dataset

Load thousands of datasets in chat API format, ready for any model:

from unitxt import load_dataset

dataset = load_dataset(
    card="cards.gpqa.diamond",
    split="test",
    format="formats.chat_api",
)

πŸ“Š Available on The Catalog

Tasks Datasets Prompts Benchmarks Metrics

πŸš€ Interactive Dashboard

Launch the graphical user interface to explore datasets and benchmarks:

pip install unitxt[ui]
unitxt-explore

Complete Python Example

Evaluate your own data with any model:

# Import required components
from unitxt import evaluate, create_dataset
from unitxt.blocks import Task, InputOutputTemplate
from unitxt.inference import HFAutoModelInferenceEngine

# Question-answer dataset
data = [
    {"question": "What is the capital of Texas?", "answer": "Austin"},
    {"question": "What is the color of the sky?", "answer": "Blue"},
]

# Define the task and evaluation metric
task = Task(
    input_fields={"question": str},
    reference_fields={"answer": str},
    prediction_type=str,
    metrics=["metrics.accuracy"],
)

# Create a template to format inputs and outputs
template = InputOutputTemplate(
    instruction="Answer the following question.",
    input_format="{question}",
    output_format="{answer}",
    postprocessors=["processors.lower_case"],
)

# Prepare the dataset
dataset = create_dataset(
    task=task,
    template=template,
    format="formats.chat_api",
    test_set=data,
    split="test",
)

# Set up the model (supports Hugging Face, WatsonX, OpenAI, etc.)
model = HFAutoModelInferenceEngine(
    model_name="Qwen/Qwen1.5-0.5B-Chat", max_new_tokens=32
)

# Generate predictions and evaluate
predictions = model(dataset)
results = evaluate(predictions=predictions, data=dataset)

# Print results
print("Global Results:\n", results.global_scores.summary)
print("Instance Results:\n", results.instance_scores.summary)

Contributing

Read the contributing guide for details on how to contribute to Unitxt.

Citation

If you use Unitxt in your research, please cite our paper:

@inproceedings{bandel-etal-2024-unitxt,
    title = "Unitxt: Flexible, Shareable and Reusable Data Preparation and Evaluation for Generative {AI}",
    author = "Bandel, Elron  and
      Perlitz, Yotam  and
      Venezian, Elad  and
      Friedman, Roni  and
      Arviv, Ofir  and
      Orbach, Matan  and
      Don-Yehiya, Shachar  and
      Sheinwald, Dafna  and
      Gera, Ariel  and
      Choshen, Leshem  and
      Shmueli-Scheuer, Michal  and
      Katz, Yoav",
    editor = "Chang, Kai-Wei  and
      Lee, Annie  and
      Rajani, Nazneen",
    booktitle = "Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 3: System Demonstrations)",
    month = jun,
    year = "2024",
    address = "Mexico City, Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2024.naacl-demo.21",
    pages = "207--215",
}

About

πŸ¦„ Unitxt is a Python library for enterprise-grade evaluation of AI performance, offering the world's largest catalog of tools and data for end-to-end AI benchmarking

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Languages