Skip to content

Extreme Slowness/Timeouts with Large Dataset (Chunked, Multi-Band Zarr Store) #3085

Open
@mickyals

Description

@mickyals

Zarr version

v3.0.6

Environment

OS: Linux Ubuntu

Python: 3.11.7

Dependencies: numcodecs[pcodec]==0.15.1, xarray==2025.3.1, dask==2024.2.0

Description

A Zarr store containing GOES-16 satellite data (dimensions: t:8741, lat:8133, lon:8130, chunks (24, 512, 512)) exhibits severe performance issues:

  • Operations like .compute(), rolling-window means, and plotting take orders of magnitude longer than expected.
  • Jupyter kernels time out with IOStream.flush timed out during computation/plotting.

The final dataset of all 2023 GOES imagery is about 11TB with 6 more years of data left to be processed.

Steps to Reproduce

The minimal code logic for reproduction is below

import zarr
from numcodecs.zarr3 import PCodec
import numpy as np

# Creating store with similar structure
store = zarr.DirectoryStore("test_store.zarr")
root = zarr.group(store=store)

# Simulate a single band (repeat for multiple bands)
chunks = (24, 512, 512)
shape = (8741, 8133, 8130)
band = root.create_dataset(
    name="CMI_C07",
    shape=shape,
    chunks=chunks,
    dtype="uint16",
    compressor=zarr.Zstd(level=9),
    serializer=PCodec(level=9)
)

# Simulate data appending 
for i in range(0, shape[0], chunks[0]):
    band[i:i+chunks[0]] = np.random.randint(0, 65535, (chunks[0],) + chunks[1:], dtype="uint16")

# Consolidate metadata
zarr.consolidate_metadata(store)

Questions for Zarr Devs

  1. Are there known performance bottlenecks with high-dimensional, multi-band Zarr stores?
  2. Could chunking/sharding interact poorly with dask/xarray for time-series operations?
  3. Where could the bottlenecks be, and what strategies would improve performance?
  4. Recommendations and considerations in using zarr for generating very large public datasets.

Additional Context

Full Dataset creation code: https://github.com/mickyals/goes2zarr/blob/main/convert_goes_to_zarr.py

Metadata

Metadata

Assignees

No one assigned

    Labels

    performancePotential issues with Zarr performance (I/O, memory, etc.)

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions