@@ -93,10 +93,233 @@ specify the URL to the blob storing the model weights and use read mode (i.e., `
93
93
model.load_state_dict(torch.load(f))
94
94
95
95
96
+ .. _datasets-guide :
97
+
98
+ PyTorch Datasets
99
+ ----------------
100
+
101
+ PyTorch offers the `Dataset and DataLoader primitives <PyTorch dataset tutorial _>`_ for
102
+ loading data samples. ``azstoragetorch `` provides implementations for both types
103
+ of PyTorch datasets, `map-style and iterable-style datasets <PyTorch dataset types _>`_,
104
+ to load data samples from Azure Blob Storage:
105
+
106
+ * :py:class: `azstoragetorch.datasets.BlobDataset ` - `Map-style dataset <PyTorch dataset map-style _>`_.
107
+ Use this class for random access to data samples. The class eagerly lists samples in
108
+ dataset on instantiation.
109
+
110
+ * :py:class: `azstoragetorch.datasets.IterableBlobDataset ` - `Iterable-style dataset <PyTorch dataset iterable-style _>`_.
111
+ Use this class when working with large datasets that may not fit in memory. The class
112
+ lazily lists samples as dataset is iterated over.
113
+
114
+ Data samples returned from both datasets map directly one-to-one to blobs in Azure Blob Storage.
115
+ Both classes can be directly provided to a PyTorch :py:class: `~torch.utils.data.DataLoader `
116
+ (read more :ref: `here <datasets-guide-with-dataloader >`). When instantiating these dataset
117
+ classes, use one of their class methods:
118
+
119
+ * ``from_container_url() `` - Instantiate dataset by listing blobs from an Azure Storage container.
120
+ * ``from_blob_urls() `` - Instantiate dataset from provided blob URLs
121
+
122
+ Instantiation directly using ``__init__() `` is **not ** supported. Read sections below on
123
+ how to use these class methods to create datasets.
124
+
125
+
126
+ Create Dataset from Azure Storage Container
127
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
128
+
129
+ To create an ``azstoragetorch `` dataset by listing blobs in a single Azure Storage container,
130
+ use the dataset class's corresponding ``from_container_url() `` method:
131
+
132
+ * :py:meth: `azstoragetorch.datasets.BlobDataset.from_container_url() ` for map-style dataset
133
+ * :py:meth: `azstoragetorch.datasets.IterableBlobDataset.from_container_url() ` for iterable-style dataset
134
+
135
+ The methods accept the URL to the Azure Storage container to list blobs from. Listing
136
+ is performed using the `List Blobs API <List Blobs API _>`_. For example::
137
+
138
+ from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
139
+
140
+ # Update URL with your own Azure Storage account and container name
141
+ CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
142
+
143
+ # Create a map-style dataset by listing blobs in the container specified by CONTAINER_URL.
144
+ map_dataset = BlobDataset.from_container_url(CONTAINER_URL)
145
+
146
+ # Create an iterable-style dataset by listing blobs in the container specified by CONTAINER_URL.
147
+ iterable_dataset = IterableBlobDataset.from_container_url(CONTAINER_URL)
148
+
149
+ The above examples lists all blobs in the container. To only include blobs whose name starts with
150
+ a specific prefix, provide the ``prefix `` keyword argument::
151
+
152
+ from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
153
+
154
+ # Update URL with your own Azure Storage account and container name
155
+ CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
156
+
157
+ # Create a map-style dataset only including blobs whose name starts with the prefix "images/"
158
+ map_dataset = BlobDataset.from_container_url(CONTAINER_URL, prefix="images/")
159
+
160
+ # Create an iterable-style dataset only including blobs whose name starts with the prefix "images/"
161
+ iterable_dataset = IterableBlobDataset.from_container_url(CONTAINER_URL, prefix="images/")
162
+
163
+
164
+ Create Dataset from List of Blobs
165
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
166
+
167
+ To create an ``azstoragetorch `` dataset from a pre-defined list of blobs, use the dataset class's
168
+ corresponding ``from_blob_urls() `` method:
169
+
170
+ * :py:meth: `azstoragetorch.datasets.BlobDataset.from_blob_urls() ` for map-style dataset
171
+ * :py:meth: `azstoragetorch.datasets.IterableBlobDataset.from_blob_urls() ` for iterable-style dataset
172
+
173
+ The method accepts a list of blob URLs to create the dataset from. For example::
174
+
175
+ from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
176
+
177
+ # Update URL with your own Azure Storage account and container name
178
+ CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
179
+
180
+ # List of blob URLs to create dataset from. Update with your own blob names.
181
+ blob_urls = [
182
+ f"{CONTAINER_URL}/<blob-name-1>",
183
+ f"{CONTAINER_URL}/<blob-name-2>",
184
+ f"{CONTAINER_URL}/<blob-name-3>",
185
+ ]
186
+
187
+ # Create a map-style dataset from the list of blob URLs
188
+ map_dataset = BlobDataset.from_blob_urls(blob_urls)
189
+
190
+ # Create an iterable-style dataset from the list of blob URLs
191
+ iterable_dataset = IterableBlobDataset.from_blob_urls(blob_urls)
192
+
193
+
194
+ Transforming Dataset Output
195
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~
196
+
197
+ The default output format of dataset samples are dictionaries representing a blob
198
+ in the dataset. Each dictionary has the keys:
199
+
200
+ * ``url ``: The full endpoint URL of the blob.
201
+ * ``data ``: The content of the blob as :py:class: `bytes `.
202
+
203
+ For example, when accessing a dataset sample::
204
+
205
+ print(map_dataset[0])
206
+
207
+
208
+ It will have the following return format::
209
+
210
+ {
211
+ "url": "https://<account-name>.blob.core.windows.net/<container-name>/<blob-name>",
212
+ "data": b"<blob-content>"
213
+ }
214
+
215
+
216
+ To override the output format, provide a ``transform `` callable to either ``from_blob_urls ``
217
+ or ``from_container_url `` when creating the dataset. The ``transform `` callable accepts a
218
+ single positional argument of type :py:class: `azstoragetorch.datasets.Blob ` representing
219
+ a blob in the dataset. This :py:class: `~azstoragetorch.datasets.Blob ` object can be used to
220
+ retrieve properties and content of the blob as part of the ``transform `` callable.
221
+
222
+ Emulating the `PyTorch transform tutorial <PyTorch transform tutorial _>`_, the example below shows
223
+ how to transform a :py:class: `~azstoragetorch.datasets.Blob ` object to a :py:class: `torch.Tensor ` of
224
+ a :py:mod: `PIL.Image `::
225
+
226
+ from azstoragetorch.datasets import BlobDataset, Blob
227
+ import PIL.Image # Install separately: ``pip install pillow``
228
+ import torch
229
+ import torchvision.transforms # Install separately: ``pip install torchvision``
230
+
231
+ # Update URL with your own Azure Storage account, container, and blob containing an image
232
+ IMAGE_BLOB_URL = "https://<storage-account-name>.blob.core.windows.net/<container-name>/<blob-image-name>"
233
+
234
+ # Define transform to convert blob to a tuple of (image_name, image_tensor)
235
+ def to_img_name_and_tensor(blob: Blob) -> tuple[str, torch.Tensor]:
236
+ # Use blob reader to retrieve blob contents and then transform to an image tensor.
237
+ with blob.reader() as f:
238
+ image = PIL.Image.open(f)
239
+ image_tensor = torchvision.transforms.ToTensor()(image)
240
+ return blob.blob_name, image_tensor
241
+
242
+ # Provide transform to dataset constructor
243
+ dataset = BlobDataset.from_blob_urls(
244
+ IMAGE_BLOB_URL,
245
+ transform=to_img_name_and_tensor,
246
+ )
247
+
248
+ print(dataset[0]) # Prints tuple of (image_name, image_tensor) for blob in dataset
249
+
250
+ The output should include the blob name and :py:class: `~torch.Tensor ` of the image::
251
+
252
+ ("<blob-image-name>", tensor([...]))
253
+
254
+
255
+ .. _datasets-guide-with-dataloader :
256
+
257
+ Using Dataset with PyTorch DataLoader
258
+ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
259
+
260
+ Once instantiated, ``azstoragetorch `` datasets can be provided directly to a PyTorch
261
+ :py:class: `~torch.utils.data.DataLoader ` for loading samples::
262
+
263
+ from azstoragetorch.datasets import BlobDataset
264
+ from torch.utils.data import DataLoader
265
+
266
+ # Update URL with your own Azure Storage account and container name
267
+ CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
268
+
269
+ dataset = BlobDataset.from_container_url(CONTAINER_URL)
270
+
271
+ # Create a DataLoader to load data samples from the dataset in batches of 32
272
+ dataloader = DataLoader(dataset, batch_size=32)
273
+
274
+ for batch in dataloader:
275
+ print(batch["url"]) # Prints blob URLs for each 32 sample batch
276
+
277
+
278
+ Iterable-style Datasets with Multiple Workers
279
+ ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
280
+
281
+ When using a :py:class: `~azstoragetorch.datasets.IterableBlobDataset ` and
282
+ :py:class: `~torch.utils.data.DataLoader ` with multiple workers (i.e., ``num_workers > 1 ``), the
283
+ :py:class: `~azstoragetorch.datasets.IterableBlobDataset ` automatically shards data samples
284
+ returned across workers to avoid a :py:class: `~torch.utils.data.DataLoader ` from returning
285
+ duplicate samples from its workers::
286
+
287
+ from azstoragetorch.datasets import IterableBlobDataset
288
+ from torch.utils.data import DataLoader
289
+
290
+ # Update URL with your own Azure Storage account and container name
291
+ CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
292
+
293
+ dataset = IterableBlobDataset.from_container_url(CONTAINER_URL)
294
+
295
+ # Iterate over the dataset to get the number of samples in it
296
+ num_samples_from_dataset = len([blob["url"] for blob in dataset])
297
+
298
+ # Create a DataLoader to load data samples from the dataset in batches of 32 using 4 workers
299
+ dataloader = DataLoader(dataset, batch_size=32, num_workers=4)
300
+
301
+ # Iterate over the DataLoader to get the number of samples returned from it
302
+ num_samples_from_dataloader = 0
303
+ for batch in dataloader:
304
+ num_samples_from_dataloader += len(batch["url"])
305
+
306
+ # The number of samples returned from the dataset should be equal to the number of samples
307
+ # returned from the DataLoader. If the dataset did not handle sharding, the number of samples
308
+ # returned from the DataLoader would be ``num_workers`` times (i.e., four times) the number
309
+ # of samples in the dataset.
310
+ assert num_samples_from_dataset == num_samples_from_dataloader
311
+
312
+
96
313
.. _Azure subscription : https://azure.microsoft.com/free/
97
314
.. _Azure storage account : https://learn.microsoft.com/azure/storage/common/storage-account-overview
98
315
.. _pip : https://pypi.org/project/pip/
99
316
.. _Microsoft Entra ID tokens : https://learn.microsoft.com/azure/storage/blobs/authorize-access-azure-active-directory
100
317
.. _DefaultAzureCredential guide : https://learn.microsoft.com/azure/developer/python/sdk/authentication/credential-chains?tabs=dac#defaultazurecredential-overview
101
318
.. _SAS : https://learn.microsoft.com/azure/storage/common/storage-sas-overview
102
319
.. _PyTorch checkpoint tutorial : https://pytorch.org/tutorials/beginner/saving_loading_models.html
320
+ .. _PyTorch dataset tutorial : https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#datasets-dataloaders
321
+ .. _PyTorch dataset types : https://pytorch.org/docs/stable/data.html#dataset-types
322
+ .. _PyTorch dataset map-style : https://pytorch.org/docs/stable/data.html#map-style-datasets
323
+ .. _PyTorch dataset iterable-style : https://pytorch.org/docs/stable/data.html#iterable-style-datasets
324
+ .. _List Blobs API : https://learn.microsoft.com/rest/api/storageservices/list-blobs?tabs=microsoft-entra-id
325
+ .. _PyTorch transform tutorial : https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html
0 commit comments