GitHub - TencentARC/Video-Holmes: Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng^1,2, Yuying Ge^1,✉, Teng Wang^1,✉, Yixiao Ge¹, Jing Liao², Ying Shan¹
¹ARC Lab, Tencent PCG, ²City University of Hong Kong

🔎 Introduction

Video-Holmes is a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.

Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films (ranging from 1 to 5 minutes), which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments.

⭐ Key Aspects of Video-Holmes:

One-Click Evaluation: Videos, questions, and evaluation codes are packaged on GitHub and Huggingface.
High Reasoning Demand: Significant performance gap between reasoning models and non-reasoning models.
Reasoning Process Analysis: Clearly visualizes the reasons behind correct and incorrect model responses.

We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. Please visit our homepage for more details!

📅 News

[2025-05-29] 🔥We released the training set of Video-Holmes, which consists of 233 videos and 1,551 questions.
[2025-05-28] 🔥We released Video-Holmes and corresponding evaluation codes.

🚩 Plan

Release suspense short film annotations
Release benchmark construction codes
Release training data
Support evaluation from VLMEvalKit

🏆 Leaderboard

🏅 Best performance model: Gemini-2.5-Pro

🏅 Best thinking model based on Qwen2.5-VL-7B: Video-R1

➡️ Full leaderboard

Welcome to contact us at [email protected] to upload your model to the leaderboard.

🚀 Quick Start

To download Video-Holmes, you can run the following commands:

git clone https://github.com/TencentARC/Video-Holmes.git
cd Video-Holmes
pip install huggingface_hub
python download.py --hf_token YOUR_HUGGINGFACE_ACCESS_TOKEN
unzip Benchmark/videos.zip -d Benchmark/
unzip Benchmark/annotations.zip -d Benchmark/

We provide all-in-one evaluation codes for baseline models:

python evaluate.py --model_name YOUR_MODEL_NAME --model_path YOUR_MODEL_PATH (optional)

Supported Model List:

QwenVL	QwenVL-RL	InternVL	Gemini
Qwen2.5-VL-7B	VideoChat-R1	InternVL2.5-8B	gemini-2.0-flash
Qwen2.5-VL-32B	Video-R1	InternVL3-8B	gemini-2.0-pro-exp

You can also customize your model by specifying the --model_path argument, or by implementing the following functions: prepare_your_model (line 388) and generate_your_model (line 439).

🧐 Reasoning Process Analysis

You first need to apply a DeepSeek API key and then you can run the following commands to analyze the reasoning process of your models:

python evaluate_reasoning.py --model_name YOUR_MODEL_NAME --api_key YOUR_API_KEY

🪄 Generate Your Holmes-Test

To generate questions for your videos with annotations, you can run the following commands:

cd Pipeline
python generate_questions.py --api_key YOUR_API_KEY

Note: You can down load the video on YouTube according to the VIDEO_ID by https://www.youtube.com/watch?v=VIDEO_ID

🛠️ Construction Pipeline

We select 270 high-quality suspense short films for human annotation. Next, we design 7 challenging tasks and employ DeepSeek to generate questions. Finally, we evaluate SOTA MLLMs and use DeepSeek to analyze their responses (optional).

🗝️ Question Types

Existing benchmarks primarily involve clue-given questions, where models depend on explicitly provided clues to derive answers. In contrast, Video-Holmes adopts an active seeking paradigm, requiring models to actively locate and connect multiple relevant visual clues scattered across different video segments.

📕 License

Video-Holmes is released under the Apache-2.0 license for academic purpose only.
All videos of the Video-Holmes are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos. The copyright remains with the original owners of the video.
If any video in our dataset infringes upon your rights, please contact us for removal.

📜 Citation

If you find our work helpful, please consider giving a star ⭐ and citation 📝

@article{cheng2025video,
  title={Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?},
  author={Cheng, Junhao and Ge, Yuying and Wang, Teng and Ge, Yixiao and Liao, Jing and Shan, Ying},
  journal={arXiv preprint arXiv:2505.21374},
  year={2025}
}

🤗 Acknowledgements

We refer to MovieDreamer and VCR-Bench to build our codebase and homepage. Thanks for their wonderful project.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
Pipeline		Pipeline
assets		assets
License.txt		License.txt
README.md		README.md
download.py		download.py
evaluate.py		evaluate.py
evaluate_reasoning.py		evaluate_reasoning.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

🔎 Introduction

📅 News

🚩 Plan

🏆 Leaderboard

🚀 Quick Start

🛠️ Construction Pipeline

🗝️ Question Types

📕 License

📜 Citation

🤗 Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

License

TencentARC/Video-Holmes

Folders and files

Latest commit

History

Repository files navigation

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

🔎 Introduction

📅 News

🚩 Plan

🏆 Leaderboard

🚀 Quick Start

🛠️ Construction Pipeline

🗝️ Question Types

📕 License

📜 Citation

🤗 Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages