Open
Description
Zarr version
v3.0.6
Environment
OS: Linux Ubuntu
Python: 3.11.7
Dependencies: numcodecs[pcodec]==0.15.1, xarray==2025.3.1, dask==2024.2.0
Description
A Zarr store containing GOES-16 satellite data (dimensions: t:8741, lat:8133, lon:8130, chunks (24, 512, 512)) exhibits severe performance issues:
- Operations like .compute(), rolling-window means, and plotting take orders of magnitude longer than expected.
- Jupyter kernels time out with IOStream.flush timed out during computation/plotting.
The final dataset of all 2023 GOES imagery is about 11TB with 6 more years of data left to be processed.
Steps to Reproduce
The minimal code logic for reproduction is below
import zarr
from numcodecs.zarr3 import PCodec
import numpy as np
# Creating store with similar structure
store = zarr.DirectoryStore("test_store.zarr")
root = zarr.group(store=store)
# Simulate a single band (repeat for multiple bands)
chunks = (24, 512, 512)
shape = (8741, 8133, 8130)
band = root.create_dataset(
name="CMI_C07",
shape=shape,
chunks=chunks,
dtype="uint16",
compressor=zarr.Zstd(level=9),
serializer=PCodec(level=9)
)
# Simulate data appending
for i in range(0, shape[0], chunks[0]):
band[i:i+chunks[0]] = np.random.randint(0, 65535, (chunks[0],) + chunks[1:], dtype="uint16")
# Consolidate metadata
zarr.consolidate_metadata(store)
Questions for Zarr Devs
- Are there known performance bottlenecks with high-dimensional, multi-band Zarr stores?
- Could chunking/sharding interact poorly with dask/xarray for time-series operations?
- Where could the bottlenecks be, and what strategies would improve performance?
- Recommendations and considerations in using zarr for generating very large public datasets.
Additional Context
Full Dataset creation code: https://github.com/mickyals/goes2zarr/blob/main/convert_goes_to_zarr.py