Fengxiang Wang1
Hongzhen Wang2,‡
Di Wang3
Zonghao Guo2
Zhenyu Zhong4
Long Lan1,‡
Wenjing Yang1,‡
Jing Zhang3,‡
1 National University of Defense Technology
2Tsinghua University
3Wuhan University
4Nankai University
📃 Paper | 🤗 OpticalRS-4M | 🤗 OpticalRS-13M | 🤗 Models
Dataset
:OpticalRS-13M
is a large-scale remote sensing dataset. This dataset, comprising 13 million optical images, is designed to fully leverage the representation learning capabilities of MIM methods in RS applications, distinguished by its diverse scene details. We also offer a light version, namedOpticalRS-4M
.SelectiveMAE
: A novel and efficient MIM method tailored for remote sensing images. This method incorporates a new PSTS module, which significantly accelerates convergence and enhances representation learning compared to the original MIM approach.
- Initial release of checkpoint of SelectiveMAE.
- Pretraining codes and configs for SelectiveMAE have be released.
- OpticalRS-4M dataset has be released.
- OpticalRS-13M dataset will be released.
- Codes and configs for downstream tasks of Scene Classification.
- Codes and configs for downstream tasks of Object Detection and Semantic Segmentation.
2025.08
: The object detection and semantic segmentation codes have been released.2025.07
: The classification codes have been released.2025.06
: SelectiveMAE has been accepted by ICCV2025.2025.06
: OpticalRS-13M has been released on 🤗HuggingFace.2025.06
: Models have been released on 🤗HuggingFace.2025.06
: OpticalRS-4M has been released on 🤗HuggingFace.2025.06
: The pretraining codes of the SelectiveMAE have been released.2024.06
: Paper has been released on arxiv.2024.06
: The training logs and checkpoints of the SelectiveMAE have been released.
OpticalRS-4M
available on 🤗HuggingFace via OpticalRS-4M.
Use the following command to unzip:
# if 7z is available
7z x OpticalRS-4M.zip
# if zip and unzip is available
zip -s 0 OpticalRS-4M.zip --out whole.zip
unzip whole.zip
OpticalRS-4M offers a significantly larger and more diverse image set compared to previous datasets. To evaluate its effectiveness, we pre-train a ViT-Base model using the vanilla MAE method. For comparison, we use the MillionAID dataset, maintaining an equal number of data points during training: 800 epochs for MillionAID's 1 million images and 200 epochs for our OpticalRS-4M dataset.
Dataset | Pretrained model | Images Number | Epoch | Sence Classification | Sence Classification | Object Detection | Object Detection | Semantic Segmentation | Semantic Segmentation |
---|---|---|---|---|---|---|---|---|---|
AID | RESISC-45 | DIOR | DIOR-R | LoveDA | SpaceNetv1 | ||||
OA (TR=20%/50%) | OA (TR=20%/50%) | mAP50 | mAP50 | mIoU | mF1 | ||||
MillionAID | Weights | 1 million | 800 | 94.92/97.38 | 89.20/93.60 | 71.80 | 62.33 | 51.24 | 79.24 |
OpticalRS-4M | Weights | 2 million | 400 | 96.64/98.10 | 91.80/94.31 | 73.90 | 65.95 | 52.86 | 79.37 |
OpticalRS-4M | Weights | 3 million | 267 | 96.67/98.18 | 92.24/94.41 | 75.40 | 67.07 | 52.39 | 79.37 |
OpticalRS-4M | Weights | 4 million | 200 | 96.10/98.03 | 92.38/94.30 | 74.70 | 66.26 | 52.75 | 79.23 |
OpticalRS-4M | Weights | 4 million | 800 | 96.88/98.22 | 92.44/94.43 | 75.40 | 67.35 | 52.80 | 79.41 |
OpticalRS-13M
available on 🤗HuggingFace via OpticalRS-13M. Follow OpticalRS-4M to unzip.
Please install the pretraining dependencies in SelectiveMAE/requirements.txt
:
# Optionally create a conda environment
conda create -n selectivemae python=3.10 -y
conda activate selectivemae
# Install dependencies
pip install -r requirements.txt
To pre-train ViT-Base, run the following on 8 GPUs:
torchrun --nproc_per_node=8 --nnodes 1 --master_port 16666 main_pretrain.py --batch_size 256 --selectivemae --dataset opticalrs-4m --dataset_path 'your_dataset_path' --model mae_vit_base_patch16 --output_dir output --norm_pix_loss --blr 1.5e-4 --weight_decay 0.05 --num_workers 12 --decoder_depth 12 --mask_ratio 0.85 --kept_mask_ratio 0.25 --epochs 800 --warmup_epochs 30
First, download the corresponding dataset, then set opticalrs-4m
or opticalrs-13m
, and update the dataset path accordingly. To train ViT-Small or ViT-Large, set --model mae_vit_small_patch16
or --model mae_vit_large_patch16
. You can use --accum_iter
to perform gradient accumulation if your hardware could not fit the batch size. FlashAttention 2 should be installed with pip install flash-attn --no-build-isolation
.
Model | Publication | Backbone | Sence Classification | Sence Classification | Object Detection | Object Detection | Semantic Segmentation | Semantic Segmentation |
---|---|---|---|---|---|---|---|---|
AID | RESISC-45 | DIOR | DIOR-R | LoveDA | SpaceNetv1 | |||
OA (TR=20%/50%) | OA (TR=20%/50%) | mAP50 | mAP50 | mIoU | mF1 | |||
SeCo | ICCV'21 | ResNet-50 | 93.47/95.99 | 89.64/92.91 | - | - | 43.63 | 77.09 |
GASSL | ICCV'21 | ResNet-50 | 93.55/95.92 | 90.86/93.06 | 67.40 | 65.65 | 48.76 | 78.51 |
TOV | JSTARS'23 | ResNet-50 | 95.16/97.09 | 90.97/93.79 | 70.16 | 66.33 | 49.70 | - |
CACo | CVPR'23 | ResNet-50 | 90.88/95.05 | 88.28/91.94 | 66.91 | 64.10 | 48.89 | 77.94 |
SatMAE | NIPS'22 | ViT-L | 95.02/96.94 | 91.72/94.10 | 70.89 | 65.66 | - | 78.07 |
ScaleMAE | ICCV'23 | ViT-L | 96.44/97.58 | 92.63/95.04 | 73.81 | 66.47 | - | - |
SSL4EO | GRSM'23 | ViT-S | 91.06/94.74 | 87.60/91.27 | 64.82 | 61.23 | - | - |
RingMo | TGRS'22 | Swin-B | 96.90/98.34 | 94.25/95.67 | 75.90 | - | - | - |
SatLas | ICCV'23 | Swin-B | 94.96/97.38 | 92.16/94.70 | 74.10 | 67.59 | - | - |
GFM | ICCV'23 | Swin-B | 95.47/97.09 | 92.73/94.64 | 72.84 | 67.67 | - | - |
RVSA | TGRS'23 | ViT-B+RVSA | 97.03/98.50 | 93.93/95.69 | 75.80 | 68.06 | 51.95 | - |
SelectiveMAE(OpticalRS-4M) | Baidu & HuggingFace | ViT-B | 96.90/98.12 | 93.35/94.58 | 75.70 | 67.78 | 53.05 | 79.50 |
SelectiveMAE(OpticalRS-4M) | Baidu & HuggingFace | ViT-L | 97.25/98.48 | 94.57/95.77 | 77.80 | 70.31 | 54.31 | 79.46 |
SelectiveMAE(OpticalRS-13M) | Baidu & HuggingFace | ViT-B | 97.10/98.28 | 93.70/95.48 | 75.80 | 67.69 | 52.68 | 79.44 |
SelectiveMAE(OpticalRS-13M) | Baidu & HuggingFace | ViT-L | 97.49/98.52 | 94.73/96.36 | 78.70 | 71.75 | 53.92 | 79.48 |
We evaluate the pretrained weights of SelectiveMAE on the AID and NWPU datasets using different training ratios. We have released configuration files for different datasets under the Classification
folder. Please use the files in the Classification/AID
and Classification/NWPU-RESISC45
subfolders to run the experiments with different training ratio (TR) settings for the respective datasets.
The classification experiments can be completed on a single GPU, run the following:
torchrun --nproc_per_node=1 --nnodes 1 --master_port 1888 \
--dataset 'aid' --model 'vit_base_patch16' --postfix 'sota' \
--batch_size 1024 --epochs 100 --warmup_epochs 5 \
--blr 1e-3 --weight_decay 0.05 --split 19 --tag 0 --exp_num 1 \
--data_path 'your_dataset_path' \
--finetune 'your_checkpoint.pth'
The --split
can be changed to your setting, like 19/28/55.
Training on DIOR using Faster-RCNN with a backbone network of SelectiveMAE pretrained ViT-L:
srun -J mmdet -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/train.py configs/wfx/vit-l-frcn-800-proposed-dior.py \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/dior/vit-l-frcn-800-proposed-dior \
--launcher="slurm"
Then testing and generating dection results:
srun -J mmdet -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/test.py configs/wfx/vit-l-frcn-800-proposed-dior.py \
/diwang/work_dir/wfx_iccv/finetune/dior/vit-l-frcn-800-proposed-dior/epoch_12.pth \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/dior/vit-l-frcn-800-proposed-dior/predict \
--show-dir=/diwang/work_dir/wfx_iccv/finetune/dior/vit-l-frcn-800-proposed-dior/predict/show \
--launcher="slurm" --cfg-options val_cfg=None val_dataloader=None val_evaluator=None
(Using MMRotate 1.0.0rc1) Training on DIOR-R using Oriented-RCNN with a backbone network of SelectiveMAE pretrained ViT-L:
srun -J mmrot -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/train.py configs/wfx/vit-l-orcn-800-proposed-diorr.py \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/diorr/vit-l-orcn-800-proposed-diorr \
--launcher="slurm"
(Using MMRotate 1.0.0rc1) Testing on DIOR-R for evaluation and visualizing detection maps.
srun -J mmrot -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/test.py configs/wfx/vit-l-orcn-800-proposed-diorr.py \
/diwang/work_dir/wfx_iccv/finetune/diorr/vit-l-orcn-800-proposed-diorr/epoch_12.pth \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/diorr/vit-l-orcn-800-proposed-diorr/predict \
--show-dir=/diwang/work_dir/wfx_iccv/finetune/diorr/vit-l-orcn-800-proposed-diorr/predict/show \
--launcher="slurm" --cfg-options val_cfg=None val_dataloader=None val_evaluator=None
Training on SpaceNetv1 using UperNet with a backbone network of SelectiveMAE pretrained ViT-L:
srun -J mmseg -p gpu --gres=dcu:4 --ntasks=8 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/train.py configs/wfx/vit-l-upernet-384-proposed-spacenetv1.py \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/spacenetv1/vit-l-upernet-384-proposed-spacenetv1 \
--launcher="slurm" --cfg-options 'find_unused_parameters'=True
Testing on SpaceNetv1 for accuracy evaluation and generating prediction maps:
srun -J mmseg -p gpu --gres=dcu:4 --ntasks=8 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/test.py configs/wfx/vit-l-upernet-384-proposed-spacenetv1.py \
/diwang/work_dir/wfx_iccv/finetune/spacenetv1/vit-l-upernet-384-proposed-spacenetv1/iter_80000.pth \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/spacenetv1/vit-l-upernet-384-proposed-spacenetv1/predict \
--show-dir=/diwang/work_dir/wfx_iccv/finetune/spacenetv1/vit-l-upernet-384-proposed-spacenetv1/predict/show \
--launcher="slurm" --cfg-options val_cfg=None val_dataloader=None val_evaluator=None
Online Evaluation: Testing on LoveDA for submittting online evaluation results and generating prediction maps:
srun -J mmseg -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/test.py configs/wfx/vit-l-upernet-512-proposed-loveda.py \
/diwang/work_dir/wfx_iccv/finetune/loveda/vit-l-upernet-512-proposed-loveda/iter_80000.pth \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/loveda/vit-l-upernet-512-proposed-loveda/predict \
--out=/diwang/work_dir/wfx_iccv/finetune/loveda/vit-l-upernet-512-proposed-loveda/predict/submit \
--show-dir=/diwang/work_dir/wfx_iccv/finetune/loveda/vit-l-upernet-512-proposed-loveda/predict/show \
--launcher="slurm" --cfg-options val_cfg=None val_dataloader=None val_evaluator=None
If you find SelectiveMAE helpful, please consider citing:
@article{selectivemae,
title={Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling},
author={Fengxiang Wang and Hongzhen Wang and Di Wang and Zonghao Guo and Zhenyu Zhong and Long Lan and Wenjing Yang and Jing Zhang},
year={2025},
journal={arXiv preprint arXiv:2406.11933},
}
This work is under the Apache License Version 2.0, while some specific operations in this codebase might be with other licenses. Please refer to LICENSE.md for a more careful check, if you are using our code for commercial matters.