Skip to content
Change the repository type filter

All

    Repositories list

    • OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
      Python
      6365.8k32367Updated Aug 1, 2025Aug 1, 2025
    • Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
      Python
      4642.8k14315Updated Aug 1, 2025Aug 1, 2025
    • CompassVerifier: A Unified and Robust Verifier for Large Language Models
      Jupyter Notebook
      0800Updated Jul 31, 2025Jul 31, 2025
    • Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, including Windows, Linux, macOS, iOS, Android and Web.
      Python
      26330Updated Jul 28, 2025Jul 28, 2025
    • Assessing Context-Aware Creative Intelligence in MLLMs
      JavaScript
      02100Updated Jul 22, 2025Jul 22, 2025
    • The All-in-one Judge Models introduced by Opencompass
      510810Updated Jul 15, 2025Jul 15, 2025
    • SAGA

      Public
      0500Updated Jul 11, 2025Jul 11, 2025
    • RaML

      Public
      [Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
      Jupyter Notebook
      2600Updated May 27, 2025May 27, 2025
    • BotChat

      Public
      Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
      Jupyter Notebook
      615620Updated May 22, 2025May 22, 2025
    • Ada-LEval

      Public
      The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
      Python
      35400Updated May 22, 2025May 22, 2025
    • MathBench

      Public
      [ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
      110460Updated May 22, 2025May 22, 2025
    • MMBench

      Public
      Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
      1123690Updated May 22, 2025May 22, 2025
    • ProSA

      Public
      [EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
      Python
      22800Updated May 22, 2025May 22, 2025
    • ANAH

      Public
      [ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
      Python
      45100Updated Apr 30, 2025Apr 30, 2025
    • GTA

      Public
      [NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
      Python
      811310Updated Mar 28, 2025Mar 28, 2025
    • GPassK

      Public
      [ACL 2025] Are Your LLMs Capable of Stable Reasoning?
      Python
      22820Updated Mar 18, 2025Mar 18, 2025
    • 0000Updated Feb 12, 2025Feb 12, 2025
    • Jupyter Notebook
      611250Updated Dec 16, 2024Dec 16, 2024
    • [NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
      Python
      24200Updated Nov 29, 2024Nov 29, 2024
    • Python
      1200Updated Sep 23, 2024Sep 23, 2024
    • hinode

      Public
      A clean documentation and blog theme for your Hugo site based on Bootstrap 5
      HTML
      61000Updated Sep 1, 2024Sep 1, 2024
    • storage

      Public
      0000Updated Aug 18, 2024Aug 18, 2024
    • Demo data of CompassBench
      31030Updated Aug 7, 2024Aug 7, 2024
    • CIBench

      Public
      Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
      Python
      21300Updated Jul 19, 2024Jul 19, 2024
    • .github

      Public
      1000Updated May 31, 2024May 31, 2024
    • DevEval

      Public
      A Comprehensive Benchmark for Software Development.
      Python
      1111100Updated May 30, 2024May 30, 2024
    • CodeBench

      Public
      0200Updated May 21, 2024May 21, 2024
    • T-Eval

      Public
      [ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
      Python
      16283392Updated Apr 3, 2024Apr 3, 2024
    • Code for the paper "Evaluating Large Language Models Trained on Code"
      Python
      401300Updated Mar 14, 2024Mar 14, 2024
    • 37330Updated Mar 8, 2024Mar 8, 2024