Vision transformers

Why: We need memory-efficient vision transformers (both vanilla ViT and SWIN v2) for LAION projects. These models are also generic enough to spark future use.

- a simple-but-working version of ViT can be found here: https://raw.githubusercontent.com/learning-at-home/clip_hivemind/clip_demo/clip.py
- we need to make it compatible with Hugging Face API (e.g. feature_extractor, masks, etc)
   - reference VIT: https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/vit/modeling_vit.py
   - reference SWIN: https://github.com/huggingface/transformers/blob/198c335d219a5eb4d3f124fdd1ce1a9cd9f78a9b/src/transformers/models/swin/modeling_swin.py
- would be great to also support model variant from SimMim pretraining
- add a test that these models can be instantiated, run forward and backward passes and all parameters receive gradients


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vision transformers #2

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Vision transformers #2

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions