Aleksandar Jevtić* 1 Christoph Reich* 1,2,4,5 Felix Wimbauer1,4 Oliver Hahn2 Christian Rupprecht3 Stefan Roth2,5,6 Daniel Cremers1,4,5
1TU Munich 2TU Darmstadt 3University of Oxford 4MCML 5ELIZA 6hessian.AI *equal contribution

TL;DR: SceneDINO is unsupervised and infers 3D geometry and features from a single image in a feed-forward manner. Distilling and clustering SceneDINO's 3D feature field results in unsupervised semantic scene completion predictions. SceneDINO is trained using multi-view self-supervision.
Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.
09/07/2025
: ArXiv preprint and code released. 🚀
Our Python environment is managed with Conda.
conda env create -f environment.yml
conda activate scenedino
We provide configuration files for the datasets SceneDINO is trained and evaluated on. Adjust these files and, most importantly, insert the data paths you use.
configs/dataset/kitti_360_sscbench.yaml
configs/dataset/cityscapes_seg.yaml
configs/dataset/bdd_seg.yaml
configs/dataset/realestate10k.yaml
To download KITTI-360, create and account and follow the instructions on the official website. We require the perspective images, fisheye images, raw velodyne scans, calibrations, and vehicle poses.
Our pre-trained checkpoints are stored in the CVG webshare. Download one of the checkpoints using the dedicated script. To replicate our results using ORB-SLAM3, we provide the obtained poses in datasets/kitti_360/orb_slam_poses
.
# Download best model trained on KITTI-360 (SSCBench split)
python download_checkpoint.py ssc-kitti-360-dino
python download_checkpoint.py ssc-kitti-360-dino-orb-slam
python download_checkpoint.py ssc-kitti-360-dinov2
Table 1. SSCBench-KITTI-360 results. We compare SceneDINO to the STEGO + S4C baseline in unsupervised SSC using the mean intersection over union score (mIoU) in %.
Method | Checkpoint | mIoU | ||
---|---|---|---|---|
12.8m | 25.6m | 51.2m | ||
Baseline | - | 10.53 | 9.26 | 6.60 |
SceneDINO | ssc-kitti-360-dino | 10.76 | 10.01 | 8.00 |
SceneDINO (ORB-SLAM3 poses) | ssc-kitti-360-dino-orb-slam | 10.88 | 9.86 | 7.88 |
SceneDINO (DINOv2) | ssc-kitti-360-dinov2 | 13.76 | 11.78 | 9.08 |
This simple demo script demonstrates loading a model and performing inference in 3D and rendered 2D. It can be used as a starting point to experiment with SceneDINO feature fields.
python demo_script.py -h
# First image of kitti-360 test set
python demo_script.py --ckpt <PATH-MODEL-CKPT>
# Custom image
python demo_script.py --ckpt <PATH-MODEL-CKPT> --image <PATH-DEMO-IMAGE>
For unsupervised SSC, training is performed in two stages. We provide training configurations in configs/
for each of them.
SceneDINO
First, the 3D feature fields of SceneDINO are trained.
python train.py -cn train_scenedino_kitti_360
Unsupervised SSC
Based on a SceneDINO checkpoint, we train the unsupervised SSC head.
python train.py -cn train_semantic_kitti_360
Logging
We use TensorBoard to keep track of losses, metrics, and qualitative results.
tensorboard --port 8000 --logdir out/
We further provide configurations to reproduce the evaluation results from the paper.
Unsupervised 2D Segmentation
# Unsupervised 2D Segmentation
python eval.py -cn evaluate_semantic_kitti_360
Unsupervised SSC
# Unsupervised SSC, adapted from S4C (https://github.com/ahayler/s4c)
python evaluate_model_sscbench.py -ssc <PATH-SSCBENCH> -vgt <PATH-SSCBENCH-LABELS> -cp <PATH-CHECKPOINT>.pt -f -m scenedino -p <RUN-NAME>
If you find our work useful, please consider citing our paper.
@inproceedings{Jevtic:2025:SceneDINO,
author = {Aleksandar Jevti{\'c} and
Christoph Reich and
Felix Wimbauer and
Oliver Hahn and
Christian Rupprecht and
Stefan Roth and
Daniel Cremers},
title = {Feed-Forward {SceneDINO} for Unsupervised Semantic Scene Completion},
journal = {IEEE/CVF International Conference on Computer Vision (ICCV)},
year = {2025},
}
This repository is based on the Behind The Scenes (BTS) code base.