Skip to content

Commit 641dd66

Browse files
authored
Add datasets user guide (Azure#37)
1 parent 2929610 commit 641dd66

File tree

3 files changed

+227
-2
lines changed

3 files changed

+227
-2
lines changed

doc/sphinx/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -38,6 +38,7 @@
3838
# while the learn.microsoft.com version appear to not host this file.
3939
'azure-core': ('https://azuresdkdocs.z19.web.core.windows.net/python/azure-core/latest/', None),
4040
'azure-identity': ('https://azuresdkdocs.z19.web.core.windows.net/python/azure-identity/latest/', None),
41+
'pillow': ('https://pillow.readthedocs.io/en/stable/', None),
4142
}
4243

4344
autodoc_typehints = 'both'

doc/sphinx/index.rst

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,10 @@ Azure Storage for PyTorch Documentation
44
Azure Storage for PyTorch (``azstoragetorch``) is a library that provides
55
seamless, performance-optimized integrations between `Azure Storage`_ and `PyTorch`_.
66
Use this library to easily access and store data in Azure Storage while using PyTorch. The
7-
library currently supports:
7+
library currently offers:
88

9-
* :ref:`Saving and loading PyTorch models (i.e., checkpointing) to and from Azure Blob Storage <checkpoint-guide>`
9+
* :ref:`File-like object for saving and loading PyTorch models (i.e., checkpointing) with Azure Blob Storage <checkpoint-guide>`
10+
* :ref:`PyTorch datasets for loading data samples from Azure Blob Storage <datasets-guide>`
1011

1112
Visit the :ref:`Getting Started <getting-started>` page for more information on how to start using
1213
Azure Storage for PyTorch.

doc/sphinx/user-guide.rst

Lines changed: 223 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -93,10 +93,233 @@ specify the URL to the blob storing the model weights and use read mode (i.e., `
9393
model.load_state_dict(torch.load(f))
9494

9595

96+
.. _datasets-guide:
97+
98+
PyTorch Datasets
99+
----------------
100+
101+
PyTorch offers the `Dataset and DataLoader primitives <PyTorch dataset tutorial_>`_ for
102+
loading data samples. ``azstoragetorch`` provides implementations for both types
103+
of PyTorch datasets, `map-style and iterable-style datasets <PyTorch dataset types_>`_,
104+
to load data samples from Azure Blob Storage:
105+
106+
* :py:class:`azstoragetorch.datasets.BlobDataset` - `Map-style dataset <PyTorch dataset map-style_>`_.
107+
Use this class for random access to data samples. The class eagerly lists samples in
108+
dataset on instantiation.
109+
110+
* :py:class:`azstoragetorch.datasets.IterableBlobDataset` - `Iterable-style dataset <PyTorch dataset iterable-style_>`_.
111+
Use this class when working with large datasets that may not fit in memory. The class
112+
lazily lists samples as dataset is iterated over.
113+
114+
Data samples returned from both datasets map directly one-to-one to blobs in Azure Blob Storage.
115+
Both classes can be directly provided to a PyTorch :py:class:`~torch.utils.data.DataLoader`
116+
(read more :ref:`here <datasets-guide-with-dataloader>`). When instantiating these dataset
117+
classes, use one of their class methods:
118+
119+
* ``from_container_url()`` - Instantiate dataset by listing blobs from an Azure Storage container.
120+
* ``from_blob_urls()`` - Instantiate dataset from provided blob URLs
121+
122+
Instantiation directly using ``__init__()`` is **not** supported. Read sections below on
123+
how to use these class methods to create datasets.
124+
125+
126+
Create Dataset from Azure Storage Container
127+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
128+
129+
To create an ``azstoragetorch`` dataset by listing blobs in a single Azure Storage container,
130+
use the dataset class's corresponding ``from_container_url()`` method:
131+
132+
* :py:meth:`azstoragetorch.datasets.BlobDataset.from_container_url()` for map-style dataset
133+
* :py:meth:`azstoragetorch.datasets.IterableBlobDataset.from_container_url()` for iterable-style dataset
134+
135+
The methods accept the URL to the Azure Storage container to list blobs from. Listing
136+
is performed using the `List Blobs API <List Blobs API_>`_. For example::
137+
138+
from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
139+
140+
# Update URL with your own Azure Storage account and container name
141+
CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
142+
143+
# Create a map-style dataset by listing blobs in the container specified by CONTAINER_URL.
144+
map_dataset = BlobDataset.from_container_url(CONTAINER_URL)
145+
146+
# Create an iterable-style dataset by listing blobs in the container specified by CONTAINER_URL.
147+
iterable_dataset = IterableBlobDataset.from_container_url(CONTAINER_URL)
148+
149+
The above examples lists all blobs in the container. To only include blobs whose name starts with
150+
a specific prefix, provide the ``prefix`` keyword argument::
151+
152+
from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
153+
154+
# Update URL with your own Azure Storage account and container name
155+
CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
156+
157+
# Create a map-style dataset only including blobs whose name starts with the prefix "images/"
158+
map_dataset = BlobDataset.from_container_url(CONTAINER_URL, prefix="images/")
159+
160+
# Create an iterable-style dataset only including blobs whose name starts with the prefix "images/"
161+
iterable_dataset = IterableBlobDataset.from_container_url(CONTAINER_URL, prefix="images/")
162+
163+
164+
Create Dataset from List of Blobs
165+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
166+
167+
To create an ``azstoragetorch`` dataset from a pre-defined list of blobs, use the dataset class's
168+
corresponding ``from_blob_urls()`` method:
169+
170+
* :py:meth:`azstoragetorch.datasets.BlobDataset.from_blob_urls()` for map-style dataset
171+
* :py:meth:`azstoragetorch.datasets.IterableBlobDataset.from_blob_urls()` for iterable-style dataset
172+
173+
The method accepts a list of blob URLs to create the dataset from. For example::
174+
175+
from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
176+
177+
# Update URL with your own Azure Storage account and container name
178+
CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
179+
180+
# List of blob URLs to create dataset from. Update with your own blob names.
181+
blob_urls = [
182+
f"{CONTAINER_URL}/<blob-name-1>",
183+
f"{CONTAINER_URL}/<blob-name-2>",
184+
f"{CONTAINER_URL}/<blob-name-3>",
185+
]
186+
187+
# Create a map-style dataset from the list of blob URLs
188+
map_dataset = BlobDataset.from_blob_urls(blob_urls)
189+
190+
# Create an iterable-style dataset from the list of blob URLs
191+
iterable_dataset = IterableBlobDataset.from_blob_urls(blob_urls)
192+
193+
194+
Transforming Dataset Output
195+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
196+
197+
The default output format of dataset samples are dictionaries representing a blob
198+
in the dataset. Each dictionary has the keys:
199+
200+
* ``url``: The full endpoint URL of the blob.
201+
* ``data``: The content of the blob as :py:class:`bytes`.
202+
203+
For example, when accessing a dataset sample::
204+
205+
print(map_dataset[0])
206+
207+
208+
It will have the following return format::
209+
210+
{
211+
"url": "https://<account-name>.blob.core.windows.net/<container-name>/<blob-name>",
212+
"data": b"<blob-content>"
213+
}
214+
215+
216+
To override the output format, provide a ``transform`` callable to either ``from_blob_urls``
217+
or ``from_container_url`` when creating the dataset. The ``transform`` callable accepts a
218+
single positional argument of type :py:class:`azstoragetorch.datasets.Blob` representing
219+
a blob in the dataset. This :py:class:`~azstoragetorch.datasets.Blob` object can be used to
220+
retrieve properties and content of the blob as part of the ``transform`` callable.
221+
222+
Emulating the `PyTorch transform tutorial <PyTorch transform tutorial_>`_, the example below shows
223+
how to transform a :py:class:`~azstoragetorch.datasets.Blob` object to a :py:class:`torch.Tensor` of
224+
a :py:mod:`PIL.Image`::
225+
226+
from azstoragetorch.datasets import BlobDataset, Blob
227+
import PIL.Image # Install separately: ``pip install pillow``
228+
import torch
229+
import torchvision.transforms # Install separately: ``pip install torchvision``
230+
231+
# Update URL with your own Azure Storage account, container, and blob containing an image
232+
IMAGE_BLOB_URL = "https://<storage-account-name>.blob.core.windows.net/<container-name>/<blob-image-name>"
233+
234+
# Define transform to convert blob to a tuple of (image_name, image_tensor)
235+
def to_img_name_and_tensor(blob: Blob) -> tuple[str, torch.Tensor]:
236+
# Use blob reader to retrieve blob contents and then transform to an image tensor.
237+
with blob.reader() as f:
238+
image = PIL.Image.open(f)
239+
image_tensor = torchvision.transforms.ToTensor()(image)
240+
return blob.blob_name, image_tensor
241+
242+
# Provide transform to dataset constructor
243+
dataset = BlobDataset.from_blob_urls(
244+
IMAGE_BLOB_URL,
245+
transform=to_img_name_and_tensor,
246+
)
247+
248+
print(dataset[0]) # Prints tuple of (image_name, image_tensor) for blob in dataset
249+
250+
The output should include the blob name and :py:class:`~torch.Tensor` of the image::
251+
252+
("<blob-image-name>", tensor([...]))
253+
254+
255+
.. _datasets-guide-with-dataloader:
256+
257+
Using Dataset with PyTorch DataLoader
258+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
259+
260+
Once instantiated, ``azstoragetorch`` datasets can be provided directly to a PyTorch
261+
:py:class:`~torch.utils.data.DataLoader` for loading samples::
262+
263+
from azstoragetorch.datasets import BlobDataset
264+
from torch.utils.data import DataLoader
265+
266+
# Update URL with your own Azure Storage account and container name
267+
CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
268+
269+
dataset = BlobDataset.from_container_url(CONTAINER_URL)
270+
271+
# Create a DataLoader to load data samples from the dataset in batches of 32
272+
dataloader = DataLoader(dataset, batch_size=32)
273+
274+
for batch in dataloader:
275+
print(batch["url"]) # Prints blob URLs for each 32 sample batch
276+
277+
278+
Iterable-style Datasets with Multiple Workers
279+
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
280+
281+
When using a :py:class:`~azstoragetorch.datasets.IterableBlobDataset` and
282+
:py:class:`~torch.utils.data.DataLoader` with multiple workers (i.e., ``num_workers > 1``), the
283+
:py:class:`~azstoragetorch.datasets.IterableBlobDataset` automatically shards data samples
284+
returned across workers to avoid a :py:class:`~torch.utils.data.DataLoader` from returning
285+
duplicate samples from its workers::
286+
287+
from azstoragetorch.datasets import IterableBlobDataset
288+
from torch.utils.data import DataLoader
289+
290+
# Update URL with your own Azure Storage account and container name
291+
CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
292+
293+
dataset = IterableBlobDataset.from_container_url(CONTAINER_URL)
294+
295+
# Iterate over the dataset to get the number of samples in it
296+
num_samples_from_dataset = len([blob["url"] for blob in dataset])
297+
298+
# Create a DataLoader to load data samples from the dataset in batches of 32 using 4 workers
299+
dataloader = DataLoader(dataset, batch_size=32, num_workers=4)
300+
301+
# Iterate over the DataLoader to get the number of samples returned from it
302+
num_samples_from_dataloader = 0
303+
for batch in dataloader:
304+
num_samples_from_dataloader += len(batch["url"])
305+
306+
# The number of samples returned from the dataset should be equal to the number of samples
307+
# returned from the DataLoader. If the dataset did not handle sharding, the number of samples
308+
# returned from the DataLoader would be ``num_workers`` times (i.e., four times) the number
309+
# of samples in the dataset.
310+
assert num_samples_from_dataset == num_samples_from_dataloader
311+
312+
96313
.. _Azure subscription: https://azure.microsoft.com/free/
97314
.. _Azure storage account: https://learn.microsoft.com/azure/storage/common/storage-account-overview
98315
.. _pip: https://pypi.org/project/pip/
99316
.. _Microsoft Entra ID tokens: https://learn.microsoft.com/azure/storage/blobs/authorize-access-azure-active-directory
100317
.. _DefaultAzureCredential guide: https://learn.microsoft.com/azure/developer/python/sdk/authentication/credential-chains?tabs=dac#defaultazurecredential-overview
101318
.. _SAS: https://learn.microsoft.com/azure/storage/common/storage-sas-overview
102319
.. _PyTorch checkpoint tutorial: https://pytorch.org/tutorials/beginner/saving_loading_models.html
320+
.. _PyTorch dataset tutorial: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#datasets-dataloaders
321+
.. _PyTorch dataset types: https://pytorch.org/docs/stable/data.html#dataset-types
322+
.. _PyTorch dataset map-style: https://pytorch.org/docs/stable/data.html#map-style-datasets
323+
.. _PyTorch dataset iterable-style: https://pytorch.org/docs/stable/data.html#iterable-style-datasets
324+
.. _List Blobs API: https://learn.microsoft.com/rest/api/storageservices/list-blobs?tabs=microsoft-entra-id
325+
.. _PyTorch transform tutorial: https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html

0 commit comments

Comments
 (0)