Add datasets user guide (Azure#37)

kyleknap · web-flow · commit 641dd66cf946 · 2025-04-29T12:54:31.000-07:00
diff --git a/doc/sphinx/conf.py b/doc/sphinx/conf.py
@@ -38,6 +38,7 @@
     # while the learn.microsoft.com version appear to not host this file.
     'azure-core': ('https://azuresdkdocs.z19.web.core.windows.net/python/azure-core/latest/', None),
     'azure-identity': ('https://azuresdkdocs.z19.web.core.windows.net/python/azure-identity/latest/', None),
+    'pillow': ('https://pillow.readthedocs.io/en/stable/', None),
 }
 
 autodoc_typehints = 'both'
diff --git a/doc/sphinx/index.rst b/doc/sphinx/index.rst
@@ -4,9 +4,10 @@ Azure Storage for PyTorch Documentation
 Azure Storage for PyTorch (``azstoragetorch``) is a library that provides
 seamless, performance-optimized integrations between  `Azure Storage`_ and `PyTorch`_.
 Use this library to easily access and store data in Azure Storage while using PyTorch. The
-library currently supports:
+library currently offers:
 
-* :ref:`Saving and loading PyTorch models (i.e., checkpointing) to and from Azure Blob Storage <checkpoint-guide>`
+* :ref:`File-like object for saving and loading PyTorch models (i.e., checkpointing) with Azure Blob Storage <checkpoint-guide>`
+* :ref:`PyTorch datasets for loading data samples from Azure Blob Storage <datasets-guide>`
 
 Visit the :ref:`Getting Started <getting-started>` page for more information on how to start using
 Azure Storage for PyTorch.
diff --git a/doc/sphinx/user-guide.rst b/doc/sphinx/user-guide.rst
@@ -93,10 +93,233 @@ specify the URL to the blob storing the model weights and use read mode (i.e., `
         model.load_state_dict(torch.load(f))
 
 
+.. _datasets-guide:
+
+PyTorch Datasets
+----------------
+
+PyTorch offers the `Dataset and DataLoader primitives <PyTorch dataset tutorial_>`_ for
+loading data samples. ``azstoragetorch`` provides implementations for both types
+of PyTorch datasets, `map-style and iterable-style datasets <PyTorch dataset types_>`_,
+to load data samples from Azure Blob Storage:
+
+* :py:class:`azstoragetorch.datasets.BlobDataset` - `Map-style dataset <PyTorch dataset map-style_>`_.
+  Use this class for random access to data samples. The class eagerly lists samples in
+  dataset on instantiation.
+
+* :py:class:`azstoragetorch.datasets.IterableBlobDataset` - `Iterable-style dataset <PyTorch dataset iterable-style_>`_.
+  Use this class when working with large datasets that may not fit in memory. The class
+  lazily lists samples as dataset is iterated over.
+
+Data samples returned from both datasets map directly one-to-one to blobs in Azure Blob Storage.
+Both classes can be directly provided to a PyTorch :py:class:`~torch.utils.data.DataLoader`
+(read more :ref:`here <datasets-guide-with-dataloader>`). When instantiating these dataset
+classes, use one of their class methods:
+
+* ``from_container_url()`` - Instantiate dataset by listing blobs from an Azure Storage container.
+* ``from_blob_urls()`` - Instantiate dataset from provided blob URLs
+
+Instantiation directly using ``__init__()`` is **not** supported. Read sections below on
+how to use these class methods to create datasets.
+
+
+Create Dataset from Azure Storage Container
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To create an ``azstoragetorch`` dataset by listing blobs in a single Azure Storage container,
+use the dataset class's corresponding ``from_container_url()`` method:
+
+* :py:meth:`azstoragetorch.datasets.BlobDataset.from_container_url()` for map-style dataset
+* :py:meth:`azstoragetorch.datasets.IterableBlobDataset.from_container_url()` for iterable-style dataset
+
+The methods accept the URL to the Azure Storage container to list blobs from. Listing
+is performed using the `List Blobs API <List Blobs API_>`_. For example::
+
+    from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
+
+    # Update URL with your own Azure Storage account and container name
+    CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
+
+    # Create a map-style dataset by listing blobs in the container specified by CONTAINER_URL.
+    map_dataset = BlobDataset.from_container_url(CONTAINER_URL)
+
+    # Create an iterable-style dataset by listing blobs in the container specified by CONTAINER_URL.
+    iterable_dataset = IterableBlobDataset.from_container_url(CONTAINER_URL)
+
+The above examples lists all blobs in the container. To only include blobs whose name starts with
+a specific prefix, provide the ``prefix`` keyword argument::
+
+    from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
+
+    # Update URL with your own Azure Storage account and container name
+    CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
+
+    # Create a map-style dataset only including blobs whose name starts with the prefix "images/"
+    map_dataset = BlobDataset.from_container_url(CONTAINER_URL, prefix="images/")
+
+    # Create an iterable-style dataset only including blobs whose name starts with the prefix "images/"
+    iterable_dataset = IterableBlobDataset.from_container_url(CONTAINER_URL, prefix="images/")
+
+
+Create Dataset from List of Blobs
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+To create an ``azstoragetorch`` dataset from a pre-defined list of blobs, use the dataset class's
+corresponding ``from_blob_urls()`` method:
+
+* :py:meth:`azstoragetorch.datasets.BlobDataset.from_blob_urls()` for map-style dataset
+* :py:meth:`azstoragetorch.datasets.IterableBlobDataset.from_blob_urls()` for iterable-style dataset
+
+The method accepts a list of blob URLs to create the dataset from. For example::
+
+    from azstoragetorch.datasets import BlobDataset, IterableBlobDataset
+
+    # Update URL with your own Azure Storage account and container name
+    CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
+
+    # List of blob URLs to create dataset from. Update with your own blob names.
+    blob_urls = [
+        f"{CONTAINER_URL}/<blob-name-1>",
+        f"{CONTAINER_URL}/<blob-name-2>",
+        f"{CONTAINER_URL}/<blob-name-3>",
+    ]
+
+    # Create a map-style dataset from the list of blob URLs
+    map_dataset = BlobDataset.from_blob_urls(blob_urls)
+
+    # Create an iterable-style dataset from the list of blob URLs
+    iterable_dataset = IterableBlobDataset.from_blob_urls(blob_urls)
+
+
+Transforming Dataset Output
+~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The default output format of dataset samples are dictionaries representing a blob
+in the dataset. Each dictionary has the keys:
+
+* ``url``: The full endpoint URL of the blob.
+* ``data``: The content of the blob as :py:class:`bytes`.
+
+For example, when accessing a dataset sample::
+
+    print(map_dataset[0])
+
+
+It will have the following return format::
+
+    {
+        "url": "https://<account-name>.blob.core.windows.net/<container-name>/<blob-name>",
+        "data": b"<blob-content>"
+    }
+
+
+To override the output format, provide a ``transform`` callable to either ``from_blob_urls``
+or ``from_container_url`` when creating the dataset. The ``transform`` callable accepts a
+single positional argument of type :py:class:`azstoragetorch.datasets.Blob` representing
+a blob in the dataset. This :py:class:`~azstoragetorch.datasets.Blob` object can be used to
+retrieve properties and content of the blob as part of the ``transform`` callable.
+
+Emulating the `PyTorch transform tutorial <PyTorch transform tutorial_>`_, the example below shows
+how to transform a :py:class:`~azstoragetorch.datasets.Blob` object to a :py:class:`torch.Tensor` of
+a :py:mod:`PIL.Image`::
+
+    from azstoragetorch.datasets import BlobDataset, Blob
+    import PIL.Image  # Install separately: ``pip install pillow``
+    import torch
+    import torchvision.transforms  # Install separately: ``pip install torchvision``
+
+    # Update URL with your own Azure Storage account, container, and blob containing an image
+    IMAGE_BLOB_URL = "https://<storage-account-name>.blob.core.windows.net/<container-name>/<blob-image-name>"
+
+    # Define transform to convert blob to a tuple of (image_name, image_tensor)
+    def to_img_name_and_tensor(blob: Blob) -> tuple[str, torch.Tensor]:
+        # Use blob reader to retrieve blob contents and then transform to an image tensor.
+        with blob.reader() as f:
+            image = PIL.Image.open(f)
+            image_tensor = torchvision.transforms.ToTensor()(image)
+        return blob.blob_name, image_tensor
+
+    # Provide transform to dataset constructor
+    dataset = BlobDataset.from_blob_urls(
+        IMAGE_BLOB_URL,
+        transform=to_img_name_and_tensor,
+    )
+
+    print(dataset[0])  # Prints tuple of (image_name, image_tensor) for blob in dataset
+
+The output should include the blob name and :py:class:`~torch.Tensor` of the image::
+
+    ("<blob-image-name>", tensor([...]))
+
+
+.. _datasets-guide-with-dataloader:
+
+Using Dataset with PyTorch DataLoader
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+Once instantiated, ``azstoragetorch`` datasets can be provided directly to a PyTorch
+:py:class:`~torch.utils.data.DataLoader` for loading samples::
+
+    from azstoragetorch.datasets import BlobDataset
+    from torch.utils.data import DataLoader
+
+    # Update URL with your own Azure Storage account and container name
+    CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
+
+    dataset = BlobDataset.from_container_url(CONTAINER_URL)
+
+    # Create a DataLoader to load data samples from the dataset in batches of 32
+    dataloader = DataLoader(dataset, batch_size=32)
+
+    for batch in dataloader:
+        print(batch["url"])  # Prints blob URLs for each 32 sample batch
+
+
+Iterable-style Datasets with Multiple Workers
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+When using a :py:class:`~azstoragetorch.datasets.IterableBlobDataset` and
+:py:class:`~torch.utils.data.DataLoader` with multiple workers (i.e., ``num_workers > 1``), the
+:py:class:`~azstoragetorch.datasets.IterableBlobDataset` automatically shards data samples
+returned across workers to avoid a :py:class:`~torch.utils.data.DataLoader` from returning
+duplicate samples from its workers::
+
+    from azstoragetorch.datasets import IterableBlobDataset
+    from torch.utils.data import DataLoader
+
+    # Update URL with your own Azure Storage account and container name
+    CONTAINER_URL = "https://<my-storage-account-name>.blob.core.windows.net/<my-container-name>"
+
+    dataset = IterableBlobDataset.from_container_url(CONTAINER_URL)
+
+    # Iterate over the dataset to get the number of samples in it
+    num_samples_from_dataset = len([blob["url"] for blob in dataset])
+
+    # Create a DataLoader to load data samples from the dataset in batches of 32 using 4 workers
+    dataloader = DataLoader(dataset, batch_size=32, num_workers=4)
+
+    # Iterate over the DataLoader to get the number of samples returned from it
+    num_samples_from_dataloader = 0
+    for batch in dataloader:
+        num_samples_from_dataloader += len(batch["url"])
+
+    # The number of samples returned from the dataset should be equal to the number of samples
+    # returned from the DataLoader. If the dataset did not handle sharding, the number of samples
+    # returned from the DataLoader would be ``num_workers`` times (i.e., four times) the number
+    # of samples in the dataset.
+    assert num_samples_from_dataset == num_samples_from_dataloader
+
+
 .. _Azure subscription: https://azure.microsoft.com/free/
 .. _Azure storage account: https://learn.microsoft.com/azure/storage/common/storage-account-overview
 .. _pip: https://pypi.org/project/pip/
 .. _Microsoft Entra ID tokens: https://learn.microsoft.com/azure/storage/blobs/authorize-access-azure-active-directory
 .. _DefaultAzureCredential guide: https://learn.microsoft.com/azure/developer/python/sdk/authentication/credential-chains?tabs=dac#defaultazurecredential-overview
 .. _SAS: https://learn.microsoft.com/azure/storage/common/storage-sas-overview
 .. _PyTorch checkpoint tutorial: https://pytorch.org/tutorials/beginner/saving_loading_models.html
+.. _PyTorch dataset tutorial: https://pytorch.org/tutorials/beginner/basics/data_tutorial.html#datasets-dataloaders
+.. _PyTorch dataset types: https://pytorch.org/docs/stable/data.html#dataset-types
+.. _PyTorch dataset map-style: https://pytorch.org/docs/stable/data.html#map-style-datasets
+.. _PyTorch dataset iterable-style: https://pytorch.org/docs/stable/data.html#iterable-style-datasets
+.. _List Blobs API: https://learn.microsoft.com/rest/api/storageservices/list-blobs?tabs=microsoft-entra-id
+.. _PyTorch transform tutorial: https://pytorch.org/tutorials/beginner/basics/transforms_tutorial.html

Original file line number	Diff line number	Diff line change
`@@ -38,6 +38,7 @@`
`38`	`38`	`# while the learn.microsoft.com version appear to not host this file.`
`39`	`39`	`'azure-core': ('https://azuresdkdocs.z19.web.core.windows.net/python/azure-core/latest/', None),`
`40`	`40`	`'azure-identity': ('https://azuresdkdocs.z19.web.core.windows.net/python/azure-identity/latest/', None),`
	`41`	`+ 'pillow': ('https://pillow.readthedocs.io/en/stable/', None),`
`41`	`42`	`}`
`42`	`43`
`43`	`44`	`autodoc_typehints = 'both'`