Using Vision Transformers to Improve the Aggregation of Diffusion Features for Object Pose Estimation

In this project, I aimed to improve upon the results of the paper titled "Object Pose Estimation via the Aggregation of Diffusion Features"(https://arxiv.org/abs/2403.18791), which implemented a template matching approach for object pose estimation. In their method, the authors compared features extracted from Stable Diffusion 1.5 for a reference image and its possible templates to identify the closest match, treating it as the estimated object pose. They developed three separate aggregation networks capable of performing a weighted average over the diffusion model’s features to generate a feature map that best represents the image or template. In my work, I integrated vision transformers into the architecture of the context-aware weight aggregator, with the aim of capturing global image details and enhancing the feature weighting process. An image of the modified architecture is shown below.

I trained the model on data from LINEMOD and Occlusion LINEMOD for five epochs. Unfortunately, there was no improvement observed with this modified architecture. The results showed that the accuracy on seen and unseen LINEMOD objects and seen Occlusion LINEMOD was comparable to the baseline, but performance dropped for unseen Occlusion LINEMOD, with high standard deviation across all splits of the data. The graphs below show how the accuracy changed after every epoch.

Installation

Click to expand

1. Clone this repo.

git clone https://github.com/RyanV27/diffusion-object-pose.git

2. Install environments.

conda env create -f environment.yaml
conda activate diff-feats

Data Preparation

Click to expand

Final structure of folder dataset

./dataset
    ├── linemod 
        ├── models
        ├── opencv_pose
        ├── LINEMOD
        ├── occlusionLINEMOD
    ├── templates	
        ├── linemod
            ├── train
            ├── test
    ├── LINEMOD.json # query-template pairwise for LINEMOD
    ├── occlusionLINEMOD.json # query-template pairwise for Occlusion-LINEMOD
    └── crop_image512 # pre-cropped images for LINEMOD

1. Download datasets:

Download with following gdrive links and unzip them in ./dataset. I use the same data as template-pose.

LINEMOD and Occlusion-LINEMOD (3GB)

2. Process ground-truth poses

Convert the coordinate system to BOP datasets format and save GT poses of each object separately:

python -m data.process_gt_linemod

3. Render templates

To render templates:

python -m data.render_templates --dataset linemod --disable_output --num_workers 4

4. Crop images

Crop images of LINEMOD, OcclusionLINEMOD and its templates with GT poses:

python -m data.crop_image_linemod

5. Compute neighbors with GT poses

python -m data.create_dataframe_linemod

Launch a training

Click to expand

python train_linemod.py --config_path config_run/LM_Diffusion_$split_name.json

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
assets		assets
bop_toolkit_lib		bop_toolkit_lib
checkpoints		checkpoints
config_run		config_run
data		data
dataset/results/weights		dataset/results/weights
lib		lib
logs		logs
py-mathutils		py-mathutils
.gitignore		.gitignore
README.md		README.md
blender.crash.txt		blender.crash.txt
config.json		config.json
count_png.py		count_png.py
environment.yml		environment.yml
requirements.txt		requirements.txt
test_linemod.py		test_linemod.py
train_linemod.py		train_linemod.py
verify_png.py		verify_png.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Using Vision Transformers to Improve the Aggregation of Diffusion Features for Object Pose Estimation

Installation

1. Clone this repo.

2. Install environments.

Data Preparation

Final structure of folder dataset

1. Download datasets:

2. Process ground-truth poses

3. Render templates

4. Crop images

5. Compute neighbors with GT poses

Launch a training

About

Uh oh!

Releases

Packages

Uh oh!

Languages

RyanV27/diffusion-object-pose

Folders and files

Latest commit

History

Repository files navigation

Using Vision Transformers to Improve the Aggregation of Diffusion Features for Object Pose Estimation

Installation

1. Clone this repo.

2. Install environments.

Data Preparation

Final structure of folder dataset

1. Download datasets:

2. Process ground-truth poses

3. Render templates

4. Crop images

5. Compute neighbors with GT poses

Launch a training

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages