Closed
Description
🚀 The feature
- Adding ViT architecture from this paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
- Adding DeiT architecture from this paper: "Training data-efficient image transformers & distillation through attention"
@fmassa @datumbox @mannatsingh @kazhang
Motivation, pitch
Vision Transformer models should exist in torchvision repo because they are good models :)
I'm currently working on this project.
Additional context
We can also consider adding some techniques from the following papers ^^
For example, adding Conv stem for ViT, see details in "Early Convolutions Help Transformers See Better"
References:
https://github.com/google-research/vision_transformer
https://github.com/facebookresearch/deit
https://github.com/facebookresearch/ClassyVision/blob/main/classy_vision/models/vision_transformer.py
cc @datumbox