Open
Description
Why: We need memory-efficient vision transformers (both vanilla ViT and SWIN v2) for LAION projects. These models are also generic enough to spark future use.
- a simple-but-working version of ViT can be found here: https://raw.githubusercontent.com/learning-at-home/clip_hivemind/clip_demo/clip.py
- we need to make it compatible with Hugging Face API (e.g. feature_extractor, masks, etc)
- reference VIT: https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/vit/modeling_vit.py
- reference SWIN: https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/swin/modeling_swin.py
- would be great to also support model variant from SimMim pretraining
- add a test that these models can be instantiated, run forward and backward passes and all parameters receive gradients
Metadata
Metadata
Assignees
Labels
No labels