Add tests for datasets

We should add unit tests for datasets. This will enable the codebase to be more reliable, and I expect that adding new datasets to torchvision will be a more straightforward procedure for the reviewer, as he will be able to rely on CI for a lot of the code checking.

We need to define how we will structure the tests for the datasets in order to avoid having to download the full datasets in memory, which will probably make the CI flaky due to network issues.

Here are a few alternatives:

## 1 - Add dummy test files for each dataset
Every dataset that has a `download` option should use a (newly introduced) function that performs the downloading and the extraction. The dummy test files should have the same folder structure as the original files after extraction. This way, any postprocessing of the extracted files and be tested by the testing suite.

The dummy files would be fake images (e.g., of 2x3 pixels), but that enables to test both the `__init__` and the `__getitem__` of the dataset in a straightforward way, which are the only functions that we guarantee that should be implemented for datasets.

### Drawback
While this is a very flexible alternative, adding a new dataset adds an extra burden, as now the contributor also need to create a set of fake data that should be commited into the repo.

## 2 - Add an option to load images / data from a buffer
There is a common pattern in torchvision datasets.
1 - download / extract the original dataset
2 - get a mapping of `image_id -> image_path / class_id / etc`
3 - load image (either from memory or from the image_path) and annotations

the `download` / `extract` parts should be factored out into a common function, which handles zip / tar / etc.

If we add an extra method to the datasets `get_paths_for_idx` that gives the mapping `image_id -> image_path / class_id / etc`, we could then patch this method during testing so that instead of returning an `image_path`, we return instead a file-like object, which can be read by PIL, `scipy.io.loadmat`, `xml.etree.ElementTree` and other Python libs without problems.
This will require that we specify somewhere what is the type of values that `get_paths_for_idx` return, so that we how how to generate random data that fits to what the dataset expect.

### Drawback
We might end up adding extra complexity while defining a dataset, and plus we might not even test all possible codepaths.

## 3 - Factor out dataset-specific functions and test only those
This would be an indirect testing, where we would split the dataset into standalone functions and test each function with some dummy inputs.
For example, in https://github.com/pytorch/vision/blob/d4a126b6e9bbf7e86082eb8760e12e15c09831ac/torchvision/datasets/mnist.py#L311-L328
we could generate on-the-fly a dummy object that can test those functions, without going into testing all MNIST itself.

### Drawback
We wouldn't be having end-to-end tests, but only be testing parts of the functionality. Having the tests.

# Wrapping up

There are other potential ways of implementing unit tests for the datasets that I'm potentially missing in this list.

One thing that is pretty clear to me is that we should provide a function to perform the `download` + `extract` for zip / tar / etc that is tested, and use it everywhere in the codebase.

I'm currently inclined towards going for approach number 1, but it will involve adding a number of (dummy and very small) files to the repo. It's currently the only one that tests end-to-end all the functionality that is required by the dataset API, and thus is the most robust IMO.

Thoughts?

	def read_label_file(path):
	with open(path, 'rb') as f:
	data = f.read()
	assert get_int(data[:4]) == 2049
	length = get_int(data[4:8])
	parsed = np.frombuffer(data, dtype=np.uint8, offset=8)
	return torch.from_numpy(parsed).view(length).long()


	def read_image_file(path):
	with open(path, 'rb') as f:
	data = f.read()
	assert get_int(data[:4]) == 2051
	length = get_int(data[4:8])
	num_rows = get_int(data[8:12])
	num_cols = get_int(data[12:16])
	parsed = np.frombuffer(data, dtype=np.uint8, offset=16)
	return torch.from_numpy(parsed).view(length, num_rows, num_cols)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tests for datasets #963

1 - Add dummy test files for each dataset

Drawback

2 - Add an option to load images / data from a buffer

Drawback

3 - Factor out dataset-specific functions and test only those

Drawback

Wrapping up

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add tests for datasets #963

Description

1 - Add dummy test files for each dataset

Drawback

2 - Add an option to load images / data from a buffer

Drawback

3 - Factor out dataset-specific functions and test only those

Drawback

Wrapping up

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions