OpenCompass

All

35 repositories

opencompass
Public
OpenCompass is an LLM evaluation platform, supporting a wide range of models (Llama3, Mistral, InternLM2,GPT-4,LLaMa2, Qwen,GLM, Claude, etc) over 100+ datasets.
benchmark evaluation openai llm chatgpt large-language-model llama2 llama3
Python
•
Apache License 2.0
•636•5.8k•323•67•Updated Aug 1, 2025Aug 1, 2025
VLMEvalKit
Public
Open-source evaluation toolkit of large multi-modality models (LMMs), support 220+ LMMs, 80+ benchmarks
computer-vision evaluation pytorch gemini openai vqa vit gpt multi-modal clip
Python
•
Apache License 2.0
•464•2.8k•143•15•Updated Aug 1, 2025Aug 1, 2025
CompassVerifier
Public
CompassVerifier: A Unified and Robust Verifier for Large Language Models
Jupyter Notebook
•0•8•0•0•Updated Jul 31, 2025Jul 31, 2025
MMBench-GUI
Public
Official repo of "MMBench-GUI: Hierarchical Multi-Platform Evaluation Framework for GUI Agents". It can be used to evaluate a GUI agent with a hierarchical manner across multiple platforms, including Windows, Linux, macOS, iOS, Android and Web.
benchmark-framework vision-language-model computer-use gui-agent
Python
•2•63•3•0•Updated Jul 28, 2025Jul 28, 2025
Creation-MMBench
Public
Assessing Context-Aware Creative Intelligence in MLLMs
JavaScript
•0•21•0•0•Updated Jul 22, 2025Jul 22, 2025
CompassJudger
Public
The All-in-one Judge Models introduced by Opencompass
Apache License 2.0
•5•108•1•0•Updated Jul 15, 2025Jul 15, 2025
SAGA
Public
0•5•0•0•Updated Jul 11, 2025Jul 11, 2025
RaML
Public
[Preprint 2025] Deciphering Trajectory-Aided LLM Reasoning: An Optimization Perspective
Jupyter Notebook
•2•6•0•0•Updated May 27, 2025May 27, 2025
BotChat
Public
Evaluating LLMs' multi-round chatting capability via assessing conversations generated by two LLM instances.
Jupyter Notebook
•
Apache License 2.0
•6•156•2•0•Updated May 22, 2025May 22, 2025
Ada-LEval
Public
The official implementation of "Ada-LEval: Evaluating long-context LLMs with length-adaptable benchmarks"
gpt4 llm long-context
Python
•3•54•0•0•Updated May 22, 2025May 22, 2025
MathBench
Public
[ACL 2024 Findings] MathBench: A Comprehensive Multi-Level Difficulty Mathematics Evaluation Dataset
Apache License 2.0
•1•104•6•0•Updated May 22, 2025May 22, 2025
MMBench
Public
Official Repo of "MMBench: Is Your Multi-modal Model an All-around Player?"
Apache License 2.0
•11•236•9•0•Updated May 22, 2025May 22, 2025
ProSA
Public
[EMNLP 2024 Findings] ProSA: Assessing and Understanding the Prompt Sensitivity of LLMs
Python
•
Apache License 2.0
•2•28•0•0•Updated May 22, 2025May 22, 2025
ANAH
Public
[ACL 2024] ANAH & [NeurIPS 2024] ANAH-v2 & [ICLR 2025] Mask-DPO
acl alignment gpt iclr neurips llms hallucination-detection hallucination-mitigation
Python
•
Apache License 2.0
•4•51•0•0•Updated Apr 30, 2025Apr 30, 2025
GTA
Public
[NeurIPS 2024 D&B Track] GTA: A Benchmark for General Tool Agents
llm-agent llm-evaluation
Python
•
Apache License 2.0
•8•113•1•0•Updated Mar 28, 2025Mar 28, 2025
GPassK
Public
[ACL 2025] Are Your LLMs Capable of Stable Reasoning?
large-language-model-evaluation reasoning-stability
Python
•2•28•2•0•Updated Mar 18, 2025Mar 18, 2025
oc_doc_website
Public
0•0•0•0•Updated Feb 12, 2025Feb 12, 2025
GAOKAO-Eval
Public
Jupyter Notebook
•6•112•5•0•Updated Dec 16, 2024Dec 16, 2024
CriticEval
Public
[NeurIPS 2024] A comprehensive benchmark for evaluating critique ability of LLMs
Python
•
Apache License 2.0
•2•42•0•0•Updated Nov 29, 2024Nov 29, 2024
lagent-cibench
Public
Python
•
Apache License 2.0
•1•2•0•0•Updated Sep 23, 2024Sep 23, 2024
hinode
Public
A clean documentation and blog theme for your Hugo site based on Bootstrap 5
HTML
•
MIT License
•61•0•0•0•Updated Sep 1, 2024Sep 1, 2024
storage
Public
Apache License 2.0
•0•0•0•0•Updated Aug 18, 2024Aug 18, 2024
CompassBench
Public
Demo data of CompassBench
3•10•3•0•Updated Aug 7, 2024Aug 7, 2024
CIBench
Public
Official Repo of "CIBench: Evaluation of LLMs as Code Interpreter "
Python
•
Apache License 2.0
•2•13•0•0•Updated Jul 19, 2024Jul 19, 2024
.github
Public
1•0•0•0•Updated May 31, 2024May 31, 2024
DevEval
Public
A Comprehensive Benchmark for Software Development.
Python
•
Apache License 2.0
•11•111•0•0•Updated May 30, 2024May 30, 2024
CodeBench
Public
0•2•0•0•Updated May 21, 2024May 21, 2024
T-Eval
Public
[ACL2024] T-Eval: Evaluating Tool Utilization Capability of Large Language Models Step by Step
Python
•
Apache License 2.0
•16•283•39•2•Updated Apr 3, 2024Apr 3, 2024
human-eval
Public
Code for the paper "Evaluating Large Language Models Trained on Code"
Python
•
MIT License
•401•3•0•0•Updated Mar 14, 2024Mar 14, 2024
OpenFinData
Public
Apache License 2.0
•3•73•3•0•Updated Mar 8, 2024Mar 8, 2024