Skip to content

Adding Vision Transformer to torchvision/models #4593

Closed
@yiwen-song

Description

@yiwen-song

🚀 The feature

  1. Adding ViT architecture from this paper: "An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale"
  2. Adding DeiT architecture from this paper: "Training data-efficient image transformers & distillation through attention"

@fmassa @datumbox @mannatsingh @kazhang

Motivation, pitch

Vision Transformer models should exist in torchvision repo because they are good models :)

I'm currently working on this project.

Additional context

We can also consider adding some techniques from the following papers ^^
For example, adding Conv stem for ViT, see details in "Early Convolutions Help Transformers See Better"

References:
https://github.com/google-research/vision_transformer
https://github.com/facebookresearch/deit
https://github.com/facebookresearch/ClassyVision/blob/main/classy_vision/models/vision_transformer.py

cc @datumbox

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions