Skip to content

Enable custom samplers for imbalanced datasets #8093

Open
@PierreQuinton

Description

@PierreQuinton

🚀 The feature

For each classification datasets with balanced distribution on the classes (MNIST, CIFAR-N, etc...), it would be very useful to provide a standard dataset for the imbalanced version of the dataset. For a dataset with $n$ classes, define the imbalance factor $a\in [0,1]$, then the proportion of class $i$ is typically be proportional to $a^{i/(n-1)}$, we need to normalize so that the proportions sums to $1$. For $a=1$ this is uniform and the smaller the imbalance coefficient the more imbalanced the dataset is.

I am not sure if torch vision should provide with the datasets or provide a data loader that imbalance the dataset.

Motivation, pitch

Many papers are published on the problem of training on an imbalanced dataset and testing on a balanced dataset, for instance see this. As far as I know, there is no systematic way of generating such data sets for people using Pytorch. Here are few very similar implementations that are not fully satisfying :

Such datasets seems to exist on TensorFlow, for instance section 3 of the readme of this repo provides with links to download tfrecord datasets.

I feels like it could be a very nice feature of torchvision to either contain such datasets or be able to craft them easily.

Alternatives

No response

Additional context

No response

cc @pmeier

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions