Skip to content

TencentARC/Video-Holmes

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

34 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation


Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Junhao Cheng1,2, Yuying Ge1,✉, Teng Wang1,✉, Yixiao Ge1, Jing Liao2, Ying Shan1
1ARC Lab, Tencent PCG, 2City University of Hong Kong

Website arXiv HF Dataset: Video--Holmes

🔎 Introduction

Video-Holmes is a benchmark designed to evaluate the complex video reasoning capabilities of MLLMs.

Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films (ranging from 1 to 5 minutes), which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments.

⭐ Key Aspects of Video-Holmes:

  • One-Click Evaluation: Videos, questions, and evaluation codes are packaged on GitHub and Huggingface.
  • High Reasoning Demand: Significant performance gap between reasoning models and non-reasoning models.
  • Reasoning Process Analysis: Clearly visualizes the reasons behind correct and incorrect model responses.

We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. Please visit our homepage for more details!

Teaser Image

📅 News

  • [2025-05-29] 🔥We released the training set of Video-Holmes, which consists of 233 videos and 1,551 questions.
  • [2025-05-28] 🔥We released Video-Holmes and corresponding evaluation codes.

🚩 Plan

  • Release suspense short film annotations
  • Release benchmark construction codes
  • Release training data
  • Support evaluation from VLMEvalKit

🏆 Leaderboard

🏅 Best performance model: Gemini-2.5-Pro

🏅 Best thinking model based on Qwen2.5-VL-7B: Video-R1

➡️ Full leaderboard

Welcome to contact us at [email protected] to upload your model to the leaderboard.

🚀 Quick Start

To download Video-Holmes, you can run the following commands:

git clone https://github.com/TencentARC/Video-Holmes.git
cd Video-Holmes
pip install huggingface_hub
python download.py --hf_token YOUR_HUGGINGFACE_ACCESS_TOKEN
unzip Benchmark/videos.zip -d Benchmark/
unzip Benchmark/annotations.zip -d Benchmark/

We provide all-in-one evaluation codes for baseline models:

python evaluate.py --model_name YOUR_MODEL_NAME --model_path YOUR_MODEL_PATH (optional)

Supported Model List:

QwenVL QwenVL-RL InternVL Gemini
Qwen2.5-VL-7B VideoChat-R1 InternVL2.5-8B gemini-2.0-flash
Qwen2.5-VL-32B Video-R1 InternVL3-8B gemini-2.0-pro-exp

You can also customize your model by specifying the --model_path argument, or by implementing the following functions: prepare_your_model (line 388) and generate_your_model (line 439).

🧐 Reasoning Process Analysis

You first need to apply a DeepSeek API key and then you can run the following commands to analyze the reasoning process of your models:

python evaluate_reasoning.py --model_name YOUR_MODEL_NAME --api_key YOUR_API_KEY
🪄 Generate Your Holmes-Test

To generate questions for your videos with annotations, you can run the following commands:

cd Pipeline
python generate_questions.py --api_key YOUR_API_KEY

Note: You can down load the video on YouTube according to the VIDEO_ID by https://www.youtube.com/watch?v=VIDEO_ID

🛠️ Construction Pipeline

We select 270 high-quality suspense short films for human annotation. Next, we design 7 challenging tasks and employ DeepSeek to generate questions. Finally, we evaluate SOTA MLLMs and use DeepSeek to analyze their responses (optional). Teaser Image

🗝️ Question Types

Existing benchmarks primarily involve clue-given questions, where models depend on explicitly provided clues to derive answers. In contrast, Video-Holmes adopts an active seeking paradigm, requiring models to actively locate and connect multiple relevant visual clues scattered across different video segments. Teaser Image

📕 License

  • Video-Holmes is released under the Apache-2.0 license for academic purpose only.
  • All videos of the Video-Holmes are obtained from the Internet which are not property of our institutions. Our institution are not responsible for the content nor the meaning of these videos. The copyright remains with the original owners of the video.
  • If any video in our dataset infringes upon your rights, please contact us for removal.

📜 Citation

If you find our work helpful, please consider giving a star ⭐ and citation 📝

@article{cheng2025video,
  title={Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?},
  author={Cheng, Junhao and Ge, Yuying and Wang, Teng and Ge, Yixiao and Liao, Jing and Shan, Ying},
  journal={arXiv preprint arXiv:2505.21374},
  year={2025}
}

🤗 Acknowledgements

We refer to MovieDreamer and VCR-Bench to build our codebase and homepage. Thanks for their wonderful project.

About

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages