In this project, I aimed to improve upon the results of the paper titled "Object Pose Estimation via the Aggregation of Diffusion Features"(https://arxiv.org/abs/2403.18791), which implemented a template matching approach for object pose estimation. In their method, the authors compared features extracted from Stable Diffusion 1.5 for a reference image and its possible templates to identify the closest match, treating it as the estimated object pose. They developed three separate aggregation networks capable of performing a weighted average over the diffusion model’s features to generate a feature map that best represents the image or template. In my work, I integrated vision transformers into the architecture of the context-aware weight aggregator, with the aim of capturing global image details and enhancing the feature weighting process. An image of the modified architecture is shown below.
I trained the model on data from LINEMOD and Occlusion LINEMOD for five epochs. Unfortunately, there was no improvement observed with this modified architecture. The results showed that the accuracy on seen and unseen LINEMOD objects and seen Occlusion LINEMOD was comparable to the baseline, but performance dropped for unseen Occlusion LINEMOD, with high standard deviation across all splits of the data. The graphs below show how the accuracy changed after every epoch.
Click to expand
git clone https://github.com/RyanV27/diffusion-object-pose.git
conda env create -f environment.yaml
conda activate diff-feats
Click to expand
./dataset
├── linemod
├── models
├── opencv_pose
├── LINEMOD
├── occlusionLINEMOD
├── templates
├── linemod
├── train
├── test
├── LINEMOD.json # query-template pairwise for LINEMOD
├── occlusionLINEMOD.json # query-template pairwise for Occlusion-LINEMOD
└── crop_image512 # pre-cropped images for LINEMOD
Download with following gdrive links and unzip them in ./dataset. I use the same data as template-pose.
Convert the coordinate system to BOP datasets format and save GT poses of each object separately:
python -m data.process_gt_linemod
To render templates:
python -m data.render_templates --dataset linemod --disable_output --num_workers 4
Crop images of LINEMOD, OcclusionLINEMOD and its templates with GT poses:
python -m data.crop_image_linemod
python -m data.create_dataframe_linemod
Click to expand
python train_linemod.py --config_path config_run/LM_Diffusion_$split_name.json