Dong Bao, Jun Zhou, Gervase Tuxworth, Jue Zhang, Yongsheng Gao
Install the following packages.
- python >= 3.10
- pytorch >= 2.0
- faiss-gpu >= 1.7.4
- torchvision >= 0.15.2
- torchmetrics >= 1.4.0
- opencv >= 4.6.0
- pydensecrf = 1.0rc3
- scikit-learn >= 1.1.3
- scikit-image >= 0.21.0
- einops >= 0.3.2
Please download the data and follow the structure below.
dataset root.
└───stuffthingmaps_trainval2017
│ └───train2017
│ │ *.png
│ │ ...
│ └───val2017
│ │ *.png
│ │ ...
└───train2017
│ │ *.jpg
│ │ ...
└───val2017
│ │ *.jpg
│ │ ...
└───Coco164kFull_Stuff_Coarse.txt
└───Coco164kFull_Stuff_Coarse_7.txt
└───cocostuff10k.txt
dataset root.
└───SegmentationClass
│ │ *.png
│ │ ...
└───SegmentationClassAug # contains segmentation masks from trainaug extension
│ │ *.png
│ │ ...
└───JPEGImages
│ │ *.jpg
│ │ ...
└───ImageSets
| └───Segmentation
│ │ train.txt
│ │ trainaug.txt
│ │ val.txt
We release the weights on trained HCL. The backbone of HCL is PM-ViT, which is fixed during the model training. For PM-ViT-S/16, we load Dino-pretrained ViT weights "dino_deitsmall16_pretrain.pth" (you can download either from the link in the table below or through the Dino git repo). For PM-ViT-S/8, we load Dino-pretrained ViT weights "dino_deitsmall8_pretrain.pth". Seghead and linear classifier weights are provided.
Dataset | Backbone | Pretrained ViT | Seghead | Linear Classifier |
---|---|---|---|---|
PVOC | PM-ViT-S/16 | link | link | link |
PVOC | PM-ViT-S/8 | link | link | link |
COCO-Stuff | PM-ViT-S/16 | link | link | |
COCO-Stuff | PM-ViT-S/8 | link | link |
Create a folder "weights" in the root folder with following structure:
weights
|── linear_classifier_weights
|── pretrain
└── seghead_weights
Then download these check points. Put Dino-pretrained weights to "pretrain" folder, put Seghead weights to "seghead_weights" folder, and put linear classifier weights to "linear_classifier_weights" folder.
To train HCL, go to main_hcl.py, change the corresponding hyperparameters. Then please run:
python main_hcl.py --epochs 10 --batch-size 64 --dist-url 'tcp://0.0.0.0:10001' --multiprocessing-distributed --world-size 1 --rank 0
To evaluate linear classifier, go to "linear_eval.py", select a configuration from "eval_config", and then modify "selected_config". In the end, please run:
python linear_eval.py --batch-size 16 --gpu 0
To evaluate overclustering performance, go to "overclustering_eval.py", select a configuration from "eval_config", and then modify "selected_config". In the end, please run:
python overclustering_eval.py --batch-size 16 --gpu 0
Parallel Multi-level Vision Transformer (PM-ViT), a specially designed backbone that captures multi-level object granularities and aggregates hierarchical contextual information into unified object component tokens.
Hierarchical Context Learning (HCL) of object components for USS, which focuses on learning discriminative spatial token embeddings by enhancing semantic consistency
through hierarchical context. At the core of HCL is PM-ViT, a specially designed backbone that integrates multi-level hierarchical contextual information into unified token
representations. To uncover the intrinsic semantic structures of objects, we introduce Momentum-based Global Foreground-Background Clustering (MoGoClustering). Leveraging DINO’s
foreground extraction capability, MoGoClustering clusters foreground and background object components into coherent semantic groups. It initializes cluster centroids and iteratively
refines them during the optimization process to achieve robust semantic grouping. Furthermore, coupled with a dense prediction loss, we design a Foreground-Background-Aware (FBA)
contrastive loss based on MoGoClustering to ensure that the learned dense representations are compact and consistent across views.
We evaluate the HCL on the PVOC and COCO-Stuff datasets.
We compare the unsupervised foreground extraction results (in green) of HCL with DINO cls token attention maps (in red).
Object component representation visualization on the PVOC dataset using PM-ViT-S/16. The locations with a cross on the image are the query tokens, e.g., there is a cross on
the bus wheel in the top left image. The query token is assigned a cluster ID from Cfg or Cbg,
then other tokens with the same cluster ID from other images are visualized and presented
on the right side of the query images. There are eight query tokens with different cluster IDs
included: 1) left 1: bus wheel; 2) left 2: car glass; 3) left 3: car wheel; 4) left 4: human upper
face; 5) right 1: human mouth and jaw; 6) right 2: human hand; 7) right 3: cat ear; 8) right
4: dog nose and mouth.
If you believe this project is useful, please consider starring or citing, cheers.
@article{bao2025hierarchical,
title={Hierarchical Context Learning of object components for unsupervised semantic segmentation},
author={Bao, Dong and Zhou, Jun and Tuxworth, Gervase and Zhang, Jue and Gao, Yongsheng},
journal={Pattern Recognition},
pages={111713},
year={2025},
publisher={Elsevier}
}