Skip to content

tum-vision/scenedino

Repository files navigation

Feed-Forward SceneDINO for Unsupervised Semantic Scene Completion

Aleksandar Jevtić* 1 Christoph Reich* 1,2,4,5 Felix Wimbauer1,4 Oliver Hahn2 Christian Rupprecht3 Stefan Roth2,5,6 Daniel Cremers1,4,5

1TU Munich 2TU Darmstadt 3University of Oxford 4MCML 5ELIZA 6hessian.AI *equal contribution

ICCV 2025

Paper PDF Project Page URL Project Page URL

License Framework


TL;DR: SceneDINO is unsupervised and infers 3D geometry and features from a single image in a feed-forward manner. Distilling and clustering SceneDINO's 3D feature field results in unsupervised semantic scene completion predictions. SceneDINO is trained using multi-view self-supervision.

Abstract

Semantic scene completion (SSC) aims to infer both the 3D geometry and semantics of a scene from single images. In contrast to prior work on SSC that heavily relies on expensive ground-truth annotations, we approach SSC in an unsupervised setting. Our novel method, SceneDINO, adapts techniques from self-supervised representation learning and 2D unsupervised scene understanding to SSC. Our training exclusively utilizes multi-view consistency self-supervision without any form of semantic or geometric ground truth. Given a single input image, SceneDINO infers the 3D geometry and expressive 3D DINO features in a feed-forward manner. Through a novel 3D feature distillation approach, we obtain unsupervised 3D semantics. In both 3D and 2D unsupervised scene understanding, SceneDINO reaches state-of-the-art segmentation accuracy. Linear probing our 3D features matches the segmentation accuracy of a current supervised SSC approach. Additionally, we showcase the domain generalization and multi-view consistency of SceneDINO, taking the first steps towards a strong foundation for single image 3D scene understanding.

News

  • 09/07/2025: ArXiv preprint and code released. 🚀

Setup (Installation & Datasets)

Python Environment

Our Python environment is managed with Conda.

conda env create -f environment.yml
conda activate scenedino

Datasets

We provide configuration files for the datasets SceneDINO is trained and evaluated on. Adjust these files and, most importantly, insert the data paths you use.

configs/dataset/kitti_360_sscbench.yaml
configs/dataset/cityscapes_seg.yaml
configs/dataset/bdd_seg.yaml
configs/dataset/realestate10k.yaml

KITTI-360

To download KITTI-360, create and account and follow the instructions on the official website. We require the perspective images, fisheye images, raw velodyne scans, calibrations, and vehicle poses.

Checkpoints

Our pre-trained checkpoints are stored in the CVG webshare. Download one of the checkpoints using the dedicated script. To replicate our results using ORB-SLAM3, we provide the obtained poses in datasets/kitti_360/orb_slam_poses.

# Download best model trained on KITTI-360 (SSCBench split)
python download_checkpoint.py ssc-kitti-360-dino
python download_checkpoint.py ssc-kitti-360-dino-orb-slam
python download_checkpoint.py ssc-kitti-360-dinov2

Table 1. SSCBench-KITTI-360 results. We compare SceneDINO to the STEGO + S4C baseline in unsupervised SSC using the mean intersection over union score (mIoU) in %.

Method Checkpoint mIoU
12.8m 25.6m 51.2m
Baseline - 10.53 9.26 6.60
SceneDINO ssc-kitti-360-dino 10.76 10.01 8.00
SceneDINO (ORB-SLAM3 poses) ssc-kitti-360-dino-orb-slam 10.88 9.86 7.88
SceneDINO (DINOv2) ssc-kitti-360-dinov2 13.76 11.78 9.08

Inference Demo Script

This simple demo script demonstrates loading a model and performing inference in 3D and rendered 2D. It can be used as a starting point to experiment with SceneDINO feature fields.

python demo_script.py -h

# First image of kitti-360 test set
python demo_script.py --ckpt <PATH-MODEL-CKPT>
# Custom image
python demo_script.py --ckpt <PATH-MODEL-CKPT> --image <PATH-DEMO-IMAGE>

Training

For unsupervised SSC, training is performed in two stages. We provide training configurations in configs/ for each of them.

SceneDINO

First, the 3D feature fields of SceneDINO are trained.

python train.py -cn train_scenedino_kitti_360

Unsupervised SSC

Based on a SceneDINO checkpoint, we train the unsupervised SSC head.

python train.py -cn train_semantic_kitti_360

Logging

We use TensorBoard to keep track of losses, metrics, and qualitative results.

tensorboard --port 8000 --logdir out/

Evaluation

We further provide configurations to reproduce the evaluation results from the paper.

Unsupervised 2D Segmentation

# Unsupervised 2D Segmentation
python eval.py -cn evaluate_semantic_kitti_360

Unsupervised SSC

# Unsupervised SSC, adapted from S4C (https://github.com/ahayler/s4c)
python evaluate_model_sscbench.py -ssc <PATH-SSCBENCH> -vgt <PATH-SSCBENCH-LABELS> -cp <PATH-CHECKPOINT>.pt -f -m scenedino -p <RUN-NAME>

Citation

If you find our work useful, please consider citing our paper.

@inproceedings{Jevtic:2025:SceneDINO,
    author  = {Aleksandar Jevti{\'c} and
               Christoph Reich and
               Felix Wimbauer and
               Oliver Hahn and
               Christian Rupprecht and
               Stefan Roth and
               Daniel Cremers},
    title   = {Feed-Forward {SceneDINO} for Unsupervised Semantic Scene Completion},
    journal = {IEEE/CVF International Conference on Computer Vision (ICCV)},
    year    = {2025},
}

Acknowledgements

This repository is based on the Behind The Scenes (BTS) code base.