🔗Citation

Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling

Fengxiang Wang¹    Hongzhen Wang^2,‡    Di Wang³    Zonghao Guo²
Zhenyu Zhong⁴    Long Lan^1,‡   Wenjing Yang^1,‡    Jing Zhang^3,‡
¹ National University of Defense Technology    ²Tsinghua University
³Wuhan University   ⁴Nankai University

ICCV 2025

📃 Paper | 🤗 OpticalRS-4M | 🤗 OpticalRS-13M | 🤗 Models

🎯Intruduction

Dataset: OpticalRS-13M is a large-scale remote sensing dataset. This dataset, comprising 13 million optical images, is designed to fully leverage the representation learning capabilities of MIM methods in RS applications, distinguished by its diverse scene details. We also offer a light version, named OpticalRS-4M.
SelectiveMAE: A novel and efficient MIM method tailored for remote sensing images. This method incorporates a new PSTS module, which significantly accelerates convergence and enhances representation learning compared to the original MIM approach.

✅ To do List

Initial release of checkpoint of SelectiveMAE.
Pretraining codes and configs for SelectiveMAE have be released.
OpticalRS-4M dataset has be released.
OpticalRS-13M dataset will be released.
Codes and configs for downstream tasks of Scene Classification.
Codes and configs for downstream tasks of Object Detection and Semantic Segmentation.

🔥 News

2025.08 : The object detection and semantic segmentation codes have been released.
2025.07 : The classification codes have been released.
2025.06 : SelectiveMAE has been accepted by ICCV2025.
2025.06 : OpticalRS-13M has been released on 🤗HuggingFace.
2025.06 : Models have been released on 🤗HuggingFace.
2025.06 : OpticalRS-4M has been released on 🤗HuggingFace.
2025.06 : The pretraining codes of the SelectiveMAE have been released.
2024.06 : Paper has been released on arxiv.
2024.06 : The training logs and checkpoints of the SelectiveMAE have been released.

🚀OpticalRS-4M

Usage

OpticalRS-4M available on 🤗HuggingFace via OpticalRS-4M.

Use the following command to unzip:

# if 7z is available
7z x OpticalRS-4M.zip
# if zip and unzip is available
zip -s 0 OpticalRS-4M.zip --out whole.zip
unzip whole.zip

Experiments on OpticalRS-4M

OpticalRS-4M offers a significantly larger and more diverse image set compared to previous datasets. To evaluate its effectiveness, we pre-train a ViT-Base model using the vanilla MAE method. For comparison, we use the MillionAID dataset, maintaining an equal number of data points during training: 800 epochs for MillionAID's 1 million images and 200 epochs for our OpticalRS-4M dataset.

Dataset	Pretrained model	Images Number	Epoch	Sence Classification	Sence Classification	Object Detection	Object Detection	Semantic Segmentation	Semantic Segmentation
				AID	RESISC-45	DIOR	DIOR-R	LoveDA	SpaceNetv1
				OA (TR=20%/50%)	OA (TR=20%/50%)	mAP50	mAP50	mIoU	mF1
MillionAID	Weights	1 million	800	94.92/97.38	89.20/93.60	71.80	62.33	51.24	79.24
OpticalRS-4M	Weights	2 million	400	96.64/98.10	91.80/94.31	73.90	65.95	52.86	79.37
OpticalRS-4M	Weights	3 million	267	96.67/98.18	92.24/94.41	75.40	67.07	52.39	79.37
OpticalRS-4M	Weights	4 million	200	96.10/98.03	92.38/94.30	74.70	66.26	52.75	79.23
OpticalRS-4M	Weights	4 million	800	96.88/98.22	92.44/94.43	75.40	67.35	52.80	79.41

🚀OpticalRS-13M

OpticalRS-13M available on 🤗HuggingFace via OpticalRS-13M. Follow OpticalRS-4M to unzip.

🚀SelectiveMAE

⚙️ Installation for Pretraining

Please install the pretraining dependencies in SelectiveMAE/requirements.txt:

# Optionally create a conda environment
conda create -n selectivemae python=3.10 -y
conda activate selectivemae
# Install dependencies
pip install -r requirements.txt

🚙 Pretraining for SelectiveMAE

To pre-train ViT-Base, run the following on 8 GPUs:

torchrun --nproc_per_node=8 --nnodes 1 --master_port 16666 main_pretrain.py --batch_size 256 --selectivemae --dataset opticalrs-4m --dataset_path 'your_dataset_path' --model mae_vit_base_patch16 --output_dir output --norm_pix_loss --blr 1.5e-4 --weight_decay 0.05  --num_workers 12  --decoder_depth 12 --mask_ratio 0.85 --kept_mask_ratio 0.25 --epochs 800 --warmup_epochs 30

First, download the corresponding dataset, then set opticalrs-4m or opticalrs-13m, and update the dataset path accordingly. To train ViT-Small or ViT-Large, set --model mae_vit_small_patch16 or --model mae_vit_large_patch16. You can use --accum_iter to perform gradient accumulation if your hardware could not fit the batch size. FlashAttention 2 should be installed with pip install flash-attn --no-build-isolation.

🚀 Results on downstream tasks

Model	Publication	Backbone	Sence Classification	Sence Classification	Object Detection	Object Detection	Semantic Segmentation	Semantic Segmentation
			AID	RESISC-45	DIOR	DIOR-R	LoveDA	SpaceNetv1
			OA (TR=20%/50%)	OA (TR=20%/50%)	mAP50	mAP50	mIoU	mF1
SeCo	ICCV'21	ResNet-50	93.47/95.99	89.64/92.91	-	-	43.63	77.09
GASSL	ICCV'21	ResNet-50	93.55/95.92	90.86/93.06	67.40	65.65	48.76	78.51
TOV	JSTARS'23	ResNet-50	95.16/97.09	90.97/93.79	70.16	66.33	49.70	-
CACo	CVPR'23	ResNet-50	90.88/95.05	88.28/91.94	66.91	64.10	48.89	77.94
SatMAE	NIPS'22	ViT-L	95.02/96.94	91.72/94.10	70.89	65.66	-	78.07
ScaleMAE	ICCV'23	ViT-L	96.44/97.58	92.63/95.04	73.81	66.47	-	-
SSL4EO	GRSM'23	ViT-S	91.06/94.74	87.60/91.27	64.82	61.23	-	-
RingMo	TGRS'22	Swin-B	96.90/98.34	94.25/95.67	75.90	-	-	-
SatLas	ICCV'23	Swin-B	94.96/97.38	92.16/94.70	74.10	67.59	-	-
GFM	ICCV'23	Swin-B	95.47/97.09	92.73/94.64	72.84	67.67	-	-
RVSA	TGRS'23	ViT-B+RVSA	97.03/98.50	93.93/95.69	75.80	68.06	51.95	-
SelectiveMAE(OpticalRS-4M)	Baidu & HuggingFace	ViT-B	96.90/98.12	93.35/94.58	75.70	67.78	53.05	79.50
SelectiveMAE(OpticalRS-4M)	Baidu & HuggingFace	ViT-L	97.25/98.48	94.57/95.77	77.80	70.31	54.31	79.46
SelectiveMAE(OpticalRS-13M)	Baidu & HuggingFace	ViT-B	97.10/98.28	93.70/95.48	75.80	67.69	52.68	79.44
SelectiveMAE(OpticalRS-13M)	Baidu & HuggingFace	ViT-L	97.49/98.52	94.73/96.36	78.70	71.75	53.92	79.48

🚀Classification

We evaluate the pretrained weights of SelectiveMAE on the AID and NWPU datasets using different training ratios. We have released configuration files for different datasets under the Classification folder. Please use the files in the Classification/AID and Classification/NWPU-RESISC45 subfolders to run the experiments with different training ratio (TR) settings for the respective datasets.

The classification experiments can be completed on a single GPU, run the following:

torchrun --nproc_per_node=1 --nnodes 1 --master_port 1888 \
--dataset 'aid' --model 'vit_base_patch16' --postfix 'sota' \
--batch_size 1024 --epochs 100 --warmup_epochs 5 \
--blr 1e-3  --weight_decay 0.05 --split 19  --tag 0 --exp_num 1 \
--data_path 'your_dataset_path'     \
--finetune 'your_checkpoint.pth'

The --split can be changed to your setting, like 19/28/55.

🎁Object Detection & Semantic Segmentation

Horizontal Object Detection (using MMDetection)

Training on DIOR using Faster-RCNN with a backbone network of SelectiveMAE pretrained ViT-L:

srun -J mmdet -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/train.py configs/wfx/vit-l-frcn-800-proposed-dior.py \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/dior/vit-l-frcn-800-proposed-dior \
--launcher="slurm"

Then testing and generating dection results:

srun -J mmdet -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/test.py configs/wfx/vit-l-frcn-800-proposed-dior.py \
/diwang/work_dir/wfx_iccv/finetune/dior/vit-l-frcn-800-proposed-dior/epoch_12.pth \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/dior/vit-l-frcn-800-proposed-dior/predict \
--show-dir=/diwang/work_dir/wfx_iccv/finetune/dior/vit-l-frcn-800-proposed-dior/predict/show \
--launcher="slurm" --cfg-options val_cfg=None val_dataloader=None val_evaluator=None

Rotated Object Detection (using MMRotate)

(Using MMRotate 1.0.0rc1) Training on DIOR-R using Oriented-RCNN with a backbone network of SelectiveMAE pretrained ViT-L:

srun -J mmrot -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/train.py configs/wfx/vit-l-orcn-800-proposed-diorr.py \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/diorr/vit-l-orcn-800-proposed-diorr \
--launcher="slurm"

(Using MMRotate 1.0.0rc1) Testing on DIOR-R for evaluation and visualizing detection maps.

srun -J mmrot -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/test.py configs/wfx/vit-l-orcn-800-proposed-diorr.py \
/diwang/work_dir/wfx_iccv/finetune/diorr/vit-l-orcn-800-proposed-diorr/epoch_12.pth \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/diorr/vit-l-orcn-800-proposed-diorr/predict \
--show-dir=/diwang/work_dir/wfx_iccv/finetune/diorr/vit-l-orcn-800-proposed-diorr/predict/show \
--launcher="slurm" --cfg-options val_cfg=None val_dataloader=None val_evaluator=None

Semantic Segmentation (using MMSegmentation)

Training on SpaceNetv1 using UperNet with a backbone network of SelectiveMAE pretrained ViT-L:

srun -J mmseg -p gpu --gres=dcu:4 --ntasks=8 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/train.py configs/wfx/vit-l-upernet-384-proposed-spacenetv1.py \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/spacenetv1/vit-l-upernet-384-proposed-spacenetv1 \
--launcher="slurm" --cfg-options 'find_unused_parameters'=True

Testing on SpaceNetv1 for accuracy evaluation and generating prediction maps:

srun -J mmseg -p gpu --gres=dcu:4 --ntasks=8 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/test.py configs/wfx/vit-l-upernet-384-proposed-spacenetv1.py \
/diwang/work_dir/wfx_iccv/finetune/spacenetv1/vit-l-upernet-384-proposed-spacenetv1/iter_80000.pth \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/spacenetv1/vit-l-upernet-384-proposed-spacenetv1/predict \
--show-dir=/diwang/work_dir/wfx_iccv/finetune/spacenetv1/vit-l-upernet-384-proposed-spacenetv1/predict/show \
--launcher="slurm" --cfg-options val_cfg=None val_dataloader=None val_evaluator=None

Online Evaluation: Testing on LoveDA for submittting online evaluation results and generating prediction maps:

srun -J mmseg -p gpu --gres=dcu:4 --ntasks=4 --ntasks-per-node=4 --cpus-per-task=8 --kill-on-bad-exit=1 \
python -u tools/test.py configs/wfx/vit-l-upernet-512-proposed-loveda.py \
/diwang/work_dir/wfx_iccv/finetune/loveda/vit-l-upernet-512-proposed-loveda/iter_80000.pth \
--work-dir=/diwang/work_dir/wfx_iccv/finetune/loveda/vit-l-upernet-512-proposed-loveda/predict \
--out=/diwang/work_dir/wfx_iccv/finetune/loveda/vit-l-upernet-512-proposed-loveda/predict/submit \
--show-dir=/diwang/work_dir/wfx_iccv/finetune/loveda/vit-l-upernet-512-proposed-loveda/predict/show \
--launcher="slurm" --cfg-options val_cfg=None val_dataloader=None val_evaluator=None

🔗Citation

If you find SelectiveMAE helpful, please consider citing:

@article{selectivemae,
      title={Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling}, 
      author={Fengxiang Wang and Hongzhen Wang and Di Wang and Zonghao Guo and Zhenyu Zhong and Long Lan and Wenjing Yang and Jing Zhang},
      year={2025},
      journal={arXiv preprint arXiv:2406.11933},
}

🤝License

This work is under the Apache License Version 2.0, while some specific operations in this codebase might be with other licenses. Please refer to LICENSE.md for a more careful check, if you are using our code for commercial matters.

Name		Name	Last commit message	Last commit date
Latest commit History 66 Commits
Classification		Classification
Detection		Detection
Figures		Figures
Segmentation		Segmentation
SelectiveMAE		SelectiveMAE
docs		docs
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling

🎯Intruduction

✅ To do List

🔥 News

📚 Contents

🚀OpticalRS-4M

Usage

Experiments on OpticalRS-4M

🚀OpticalRS-13M

🚀SelectiveMAE

⚙️ Installation for Pretraining

🚙 Pretraining for SelectiveMAE

🚀 Results on downstream tasks

🚀Classification

🎁Object Detection & Semantic Segmentation

Horizontal Object Detection (using MMDetection)

Rotated Object Detection (using MMRotate)

Semantic Segmentation (using MMSegmentation)

🔗Citation

🤝License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

MiliLab/SelectiveMAE

Folders and files

Latest commit

History

Repository files navigation

Harnessing Massive Satellite Imagery with Efficient Masked Image Modeling

🎯Intruduction

✅ To do List

🔥 News

📚 Contents

🚀OpticalRS-4M

Usage

Experiments on OpticalRS-4M

🚀OpticalRS-13M

🚀SelectiveMAE

⚙️ Installation for Pretraining

🚙 Pretraining for SelectiveMAE

🚀 Results on downstream tasks

🚀Classification

🎁Object Detection & Semantic Segmentation

Horizontal Object Detection (using MMDetection)

Rotated Object Detection (using MMRotate)

Semantic Segmentation (using MMSegmentation)

🔗Citation

🤝License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages