Closed
Description
A quick sketch of how we can couple zarr with async code. This is aimed slightly at pyscript, but can be useful in its own right: for instance, I asked the question a while ago what it would take to be able to fetch concurrent chunks not just from one array but, say, one chunk each from multiple arrays in a dataset.
This sketch is only for reading...
Outline:
- we subclass Group, so that
__getitem__
produces an AsyncArray - we subclass Array as AsyncArray and overwrite its
_get_selection
(which calls IO) up to__getitem__
(which is the user-facing API) - we have three stores:
- a synchronous HTTP one for the dataset metadata. This can be based on requests for standard python or pyfetch under pyodide. Note that sync calls in pyodide are limited to text, which is perfect for this use case.
- a fake synchronous store which merely stored the paths that are attempted, but returns FileNotFound for all of them
- a fake synchronous store in which we have prefilled all the keys it will ever need, i.e., this can be a simple dict
- The flow goes as follows:
- A zarr AsyncGroup is made by reading JSON files synchronously
- When we attempt to get data, we make a coroutine in which first we use the fake store and zarr's existing machinery to record all the keys that will be needed (this will temporarily make an array of NaN); then we fetch all these keys concurrently, then we populate a dict and have the existing zarr machinery read from the dict
- For interest, this is an fsspec async filesystem for pyodide. We don't need it to be this verbose for zarr.
- Note that in the browser, no fetches can ever be done without considering CORS, but any dataset known to work with zarr.js will work for this case too.
Metadata
Metadata
Assignees
Labels
No labels