Skip to content

add prototype imagenet dataset #4640

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 15 commits into from
Oct 21, 2021
Merged

add prototype imagenet dataset #4640

merged 15 commits into from
Oct 21, 2021

Conversation

pmeier
Copy link
Collaborator

@pmeier pmeier commented Oct 18, 2021

Comment on lines +39 to +53
if config.split == "train":
images = HttpResource(
"ILSVRC2012_img_train.tar",
sha256="b08200a27a8e34218a0e58fde36b0fe8f73bc377f4acea2d91602057c3ca45bb",
)
else: # config.split == "val"
images = HttpResource(
"ILSVRC2012_img_val.tar",
sha256="c7e06a6c0baccf06d8dbeb6577d71efff84673a5dbdd50633ab44f8ea0456ae0",
)

devkit = HttpResource(
"ILSVRC2012_devkit_t12.tar.gz",
sha256="b59243268c0d266621fd587d2018f69e906fb22875aca0e295b48cafaa927953",
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Although these files are not publicly accessible anymore, we can get away with it for now, since our download functionality is a no-op. I'll refactor this to include manual download instructions as soon as the torchdata download API is stable-ish.

Copy link
Member

@fmassa fmassa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

path, buffer = image_data

category = self.categories[label]
label = torch.tensor(label)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit for the future: we might want to revisit if we want to store numbers as a 0d tensor or a raw number instead. It's generally much faster and smaller to rely on the raw python number, which might be a good thing if we want to minimize transfer / storage.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, we wanted to return the labels as custom tensors, which hold the category string besides the numerical label. With that in mind, I've already wrapped the labels in torch.tensor. Other numbers are left as is.

Comment on lines +50 to +56
@property
def category_to_wnid(self) -> Dict[str, str]:
return self.info.extra.category_to_wnid

@property
def wnid_to_category(self) -> Dict[str, str]:
return self.info.extra.wnid_to_category
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to cache self.info, because otherwise we parse the categories file in every single step. I'll send a follow-up PR, because this also affects all the other datasets.

Comment on lines 77 to 78
self.extra = FrozenBunch(extra or dict())

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is added to enable each dataset to provide more static information beyond what the default DatasetInfo holds. By default this will be an empty namespace.

@pmeier pmeier requested a review from datumbox October 19, 2021 07:00
@pmeier
Copy link
Collaborator Author

pmeier commented Oct 20, 2021

Prototype test failures are real and will be fixed by #4668.

@pmeier pmeier merged commit 58f313b into pytorch:main Oct 21, 2021
@pmeier pmeier deleted the datasets/imagenet branch October 21, 2021 10:15
facebook-github-bot pushed a commit that referenced this pull request Oct 26, 2021
Summary:
* add prototype imagenet dataset

* add missing checksums

* fix mypy

* add human readable categories

* cleanup

* sort categories ascending based on wnid

* remove accidentally added file

* cleanup category file generation

* fix mypy

Reviewed By: NicolasHug

Differential Revision: D31916331

fbshipit-source-id: 38a598f951923342e488f0188f40c74d5b13108c
cyyever pushed a commit to cyyever/vision that referenced this pull request Nov 16, 2021
* add prototype imagenet dataset

* add missing checksums

* fix mypy

* add human readable categories

* cleanup

* sort categories ascending based on wnid

* remove accidentally added file

* cleanup category file generation

* fix mypy
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants