Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 164 commits into from
Jun 16, 2025
Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
Show all changes
164 commits
Select commit Hold shift + click to select a range
f5e3f78
modernize typing
d-v-b Feb 21, 2025
b4e71e2
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Feb 24, 2025
3c50f54
lint
d-v-b Feb 24, 2025
d74e7a4
new dtypes
d-v-b Feb 26, 2025
5000dcb
rename base dtype, change type to kind
d-v-b Feb 26, 2025
9cd5c51
start working on JSON serialization
d-v-b Feb 27, 2025
042fac1
get json de/serialization largely working, and start making tests pass
d-v-b Feb 27, 2025
556e390
tweak json type guards
d-v-b Feb 27, 2025
b588f70
fix dtype sizes, adjust fill value parsing in from_dict, fix tests
d-v-b Feb 27, 2025
4ed41c6
mid-refactor commit
d-v-b Mar 2, 2025
1b2c773
working form for dtype classes
d-v-b Mar 2, 2025
24930b3
remove unused code
d-v-b Mar 2, 2025
703e0e1
use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…
d-v-b Mar 2, 2025
3c232a4
push into v2
d-v-b Mar 3, 2025
b7fe986
remove endianness kwarg to methods, make it an instance variable instead
d-v-b Mar 3, 2025
d9b44b4
make wrapping safe by default
d-v-b Mar 4, 2025
bf24d69
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 4, 2025
c1a8566
dtype-specific tests
d-v-b Mar 4, 2025
2868994
more tests, fix void type default value logic
d-v-b Mar 5, 2025
9ab0b1e
fix dtype mechanics in bytescodec
d-v-b Mar 5, 2025
e9f5e26
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 5, 2025
6df84a9
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Mar 7, 2025
e14279d
remove __post_init__ magic in favor of more explicit declaration
d-v-b Mar 7, 2025
381a264
fix tests
d-v-b Mar 9, 2025
6a7857b
refactor data types
d-v-b Mar 12, 2025
e8fd72c
start design doc
d-v-b Mar 13, 2025
b22f324
more design doc
d-v-b Mar 13, 2025
b7a231e
update docs
d-v-b Mar 13, 2025
7dfcd0f
fix sphinx warnings
d-v-b Mar 13, 2025
706e6b6
tweak docs
d-v-b Mar 13, 2025
8fbf673
info about v3 data types
d-v-b Mar 13, 2025
e9aff64
adjust note
d-v-b Mar 13, 2025
44e78f5
fix: use unparametrized types in direct assignment
d-v-b Mar 13, 2025
60cac04
start fixing config
d-v-b Mar 17, 2025
120df57
Update src/zarr/core/_info.py
d-v-b Mar 17, 2025
0d9922b
add placeholder disclaimer to v3 data types summary
d-v-b Mar 17, 2025
2075952
make example runnable
d-v-b Mar 17, 2025
44369d6
placeholder section for adding a custom dtype
d-v-b Mar 17, 2025
4f3381f
define native data type and native scalar
d-v-b Mar 17, 2025
c8d7680
update data type names
d-v-b Mar 17, 2025
2a7b5a8
fix config test failures
d-v-b Mar 17, 2025
e855e54
call to_dtype once in blosc evolve_from_array_spec
d-v-b Mar 17, 2025
a2da99a
refactor dtypewrapper -> zdtype
d-v-b Mar 19, 2025
5ea3fa4
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 19, 2025
cbb159d
update code examples in docs; remove native endianness
d-v-b Mar 19, 2025
c506d09
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 19, 2025
bb11867
adjust type annotations
d-v-b Mar 20, 2025
7a619e0
fix info tests to use zdtype
d-v-b Mar 20, 2025
ea2d0bf
remove dead code and add code coverage exemption to zarr format checks
d-v-b Mar 20, 2025
042c9e5
fix: add special check for resolving int32 on windows
d-v-b Mar 20, 2025
def5eb2
add dtype entry point test
d-v-b Mar 20, 2025
1b7273b
remove default parameters for parametric dtypes; add mixin classes fo…
d-v-b Mar 21, 2025
60b2e9d
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 21, 2025
83f508c
Update docs/user-guide/data_types.rst
d-v-b Mar 24, 2025
4ceb6ed
refactor: use inheritance to remove boilerplate in dtype definitions
d-v-b Mar 24, 2025
5b9cff0
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
65f0453
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 24, 2025
cb0a7d4
update data types documentation, and expose core/dtype module to autodoc
d-v-b Mar 24, 2025
40f0063
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
9989c64
add failing endianness round-trip test
d-v-b Mar 24, 2025
a276c84
fix endianness
d-v-b Mar 24, 2025
6285739
additional check in test_explicit_endianness
d-v-b Mar 24, 2025
e9241b9
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 24, 2025
2bffe1a
add failing test for round-tripping vlen strings
d-v-b Mar 24, 2025
aa32271
route object dtype arrays to vlen string dtype when numpy > 2
d-v-b Mar 25, 2025
617d3f0
relax endianness mismatch to a warning instead of an error
d-v-b Mar 25, 2025
2b5fd8f
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
1831f20
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
a427a16
silence mypy error about array indexing
d-v-b Mar 25, 2025
41d7e58
add release note
d-v-b Mar 25, 2025
c08ffd9
fix doctests, excluding config tests
d-v-b Mar 25, 2025
778d740
revert addition of linkage between dtype endianness and bytes codec e…
d-v-b Mar 26, 2025
269215e
remove Any types
d-v-b Mar 26, 2025
8af0ce4
add docstring for wrapper module
d-v-b Mar 26, 2025
df60d05
simplify config and docs
d-v-b Mar 26, 2025
7f54bbf
update config test
d-v-b Mar 26, 2025
be83f03
fix S dtype test for v2
d-v-b Mar 26, 2025
3979746
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Apr 28, 2025
a210f9f
fully remove v3jsonencoder
d-v-b Apr 28, 2025
8fbd29a
refactor dtype module structure
d-v-b Apr 29, 2025
afc9872
add timedelta64
d-v-b Apr 29, 2025
e1bf901
refactor time dtypes
d-v-b Apr 30, 2025
45f0c88
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 1, 2025
890077e
widen dtype test strategies
d-v-b May 1, 2025
a3f05f0
modify structured dtype fill value rt to avoid to_dict
d-v-b May 2, 2025
4788f05
wip: begin creating isomorphic test suite for dtypes
d-v-b May 2, 2025
d3f9204
finish common tests
d-v-b May 2, 2025
fdf17e3
wip: test infrastructure for dtypes
d-v-b May 7, 2025
4afa42a
wip: use class-based tests for all dtypes
d-v-b May 7, 2025
4990803
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 7, 2025
1458aad
fill out more tests, and adjust sized dtypes
d-v-b May 8, 2025
9673997
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 8, 2025
aa11df4
wip: json schema test
d-v-b May 12, 2025
f706b46
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 12, 2025
52518c2
add casting tests
d-v-b May 13, 2025
4ab1c58
use relative link for changes
d-v-b May 13, 2025
e4c89f3
typo
d-v-b May 13, 2025
e386c2b
make bytes codec dtype logic a bit more literate
d-v-b May 13, 2025
703192c
increase deadline to 500ms
d-v-b May 13, 2025
0fab5e5
fewer commented sections of problematic lru_store_cache section of th…
d-v-b May 13, 2025
2f945bf
add link to gh issue about lru_cache for sharding codec
d-v-b May 13, 2025
63a6af4
attempt to speed up hypothesis tests by reducing max array size
d-v-b May 13, 2025
56e7c84
clean up docs
d-v-b May 13, 2025
eee0d7b
remove placeholder
d-v-b May 13, 2025
1dc8e72
make final example section doctested and more readable
d-v-b May 13, 2025
13ca230
revert change to auto chunking
d-v-b May 13, 2025
2a42205
revert quotation of literal type
d-v-b May 13, 2025
3f775c8
lint
d-v-b May 13, 2025
5320a77
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 13, 2025
b525b8e
fix broken code block
d-v-b May 13, 2025
ec94878
specialize test to handle stringdtype changes coming in numpy 2.3
d-v-b May 13, 2025
3af98aa
add docstring to _TestZDType class
d-v-b May 13, 2025
6388203
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
6ef7924
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
1329c69
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
d8c3672
type hints
d-v-b May 15, 2025
3f4d87a
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 16, 2025
d8a382a
expand changelog
d-v-b May 16, 2025
9aa751b
tweak docstring
d-v-b May 16, 2025
e4a0372
support v3 nan strings in JSON for float dtypes
d-v-b May 19, 2025
8a976d6
revert removal of metadata chunk grid attribute
d-v-b May 21, 2025
be0d2df
use none to denote default fill value; remove old structured tests; u…
d-v-b May 22, 2025
8c90d2c
add item size abstraction
d-v-b May 22, 2025
0fc653f
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b May 22, 2025
7c58f7a
rename fixed-length string dtypes, and be strict about the numpy obje…
d-v-b May 22, 2025
3a21845
remove vestigial use of to_dtype().itemsize()
d-v-b May 22, 2025
ce0afe3
remove another vestigial use of to_dtype().itemsize()
d-v-b May 22, 2025
e67d4dc
emit warning about unstable dtype when serializing Structured dtype t…
d-v-b May 23, 2025
4e2a157
put string dtypes in the strings module
d-v-b May 24, 2025
a1deda6
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 24, 2025
528a942
make tests isomorphic to source code
d-v-b May 24, 2025
c9c8181
remove old string logic
d-v-b May 25, 2025
1cb7734
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 26, 2025
d80d565
use scale_factor and unit in cast_value for datetime
d-v-b May 26, 2025
7806563
add regression testing against v2.18
d-v-b May 27, 2025
39219fa
truncate U and S scalars in _cast_value_unsafe
d-v-b May 27, 2025
4a7a550
docstrings and simplification for regression tests
d-v-b May 27, 2025
807c585
changes necessary for linting with regression tests
d-v-b May 27, 2025
5150d60
improve method names, refactor type hints with typeddictionaries, fix…
d-v-b May 29, 2025
9ddbe97
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b May 29, 2025
d6535d6
fix storage info discrepancy in docs
d-v-b May 29, 2025
42e14ef
fix docstring that was troubling sphinx
d-v-b May 29, 2025
3991406
wip: add vlen-bytes
d-v-b May 29, 2025
d7da3d9
add vlen-bytes
d-v-b May 29, 2025
c3c3288
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jun 2, 2025
d1feaee
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 5, 2025
3ef138a
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jun 6, 2025
1f767e4
replace placeholder text with links to a github issue
d-v-b Jun 6, 2025
cf55041
refactor fixed-length bytes dtypes
d-v-b Jun 6, 2025
24b6b35
more v3 unstable dtype warnings, and their exemptions from tests
d-v-b Jun 6, 2025
7f099a2
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
bf7e2c5
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
cbb0b0d
clean up typeddicts
d-v-b Jun 7, 2025
8f3aa68
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
e885869
update docstrings
d-v-b Jun 9, 2025
63de7c4
Update docs/user-guide/data_types.rst
d-v-b Jun 11, 2025
b069d36
refactor wrapper to allow subclasses to freely define their own type …
d-v-b Jun 13, 2025
ae36dbf
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jun 13, 2025
a1f2c94
Merge branch 'feat/fixed-length-strings' of https://github.com/d-v-b/…
d-v-b Jun 13, 2025
b2e56c8
make method definition order consistent
d-v-b Jun 14, 2025
d26b695
allow structured scalars to be np.void
d-v-b Jun 14, 2025
49f0062
use a common function signature for from_json by packing the object_c…
d-v-b Jun 15, 2025
70da4da
fix dtype doc example
d-v-b Jun 15, 2025
16b4ac6
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 16, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/zarr/api/asynchronous.py
Original file line number Diff line number Diff line change
Expand Up @@ -982,7 +982,7 @@
if zarr_format == 2:
if chunks is None:
chunks = shape
dtype = parse_dtype(dtype, zarr_format)
dtype = parse_dtype(dtype, zarr_format=zarr_format)

Check warning on line 985 in src/zarr/api/asynchronous.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/api/asynchronous.py#L985

Added line #L985 was not covered by tests
if not filters:
filters = _default_filters(dtype)
if not compressor:
Expand Down
7 changes: 4 additions & 3 deletions src/zarr/codecs/sharding.py
Original file line number Diff line number Diff line change
Expand Up @@ -355,9 +355,10 @@ def __init__(
object.__setattr__(self, "index_location", index_location_parsed)

# Use instance-local lru_cache to avoid memory leaks
object.__setattr__(self, "_get_chunk_spec", lru_cache()(self._get_chunk_spec))
object.__setattr__(self, "_get_index_chunk_spec", lru_cache()(self._get_index_chunk_spec))
object.__setattr__(self, "_get_chunks_per_shard", lru_cache()(self._get_chunks_per_shard))
# TODO: fix these when we don't get hashability errors for certain numpy dtypes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this something that needs fixing before this PR is merged?

# object.__setattr__(self, "_get_chunk_spec", lru_cache()(self._get_chunk_spec))
# object.__setattr__(self, "_get_index_chunk_spec", lru_cache()(self._get_index_chunk_spec))
# object.__setattr__(self, "_get_chunks_per_shard", lru_cache()(self._get_chunks_per_shard))

# todo: typedict return type
def __getstate__(self) -> dict[str, Any]:
Expand Down
6 changes: 4 additions & 2 deletions src/zarr/core/_info.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,9 @@

from zarr.abc.codec import ArrayArrayCodec, ArrayBytesCodec, BytesBytesCodec
from zarr.core.common import ZarrFormat
from zarr.core.metadata.v3 import DataType
from zarr.core.metadata.dtype import DTypeWrapper

# from zarr.core.metadata.v3 import DataType


@dataclasses.dataclass(kw_only=True)
Expand Down Expand Up @@ -78,7 +80,7 @@ class ArrayInfo:

_type: Literal["Array"] = "Array"
_zarr_format: ZarrFormat
_data_type: np.dtype[Any] | DataType
_data_type: np.dtype[Any] | DTypeWrapper
_shape: tuple[int, ...]
_shard_shape: tuple[int, ...] | None = None
_chunk_shape: tuple[int, ...] | None = None
Expand Down
38 changes: 22 additions & 16 deletions src/zarr/core/array.py
Original file line number Diff line number Diff line change
Expand Up @@ -98,19 +98,21 @@
ArrayV3MetadataDict,
T_ArrayMetadata,
)
from zarr.core.metadata.dtype import DTypeWrapper
from zarr.core.metadata.v2 import (
_default_compressor,
_default_filters,
parse_compressor,
parse_filters,
)
from zarr.core.metadata.v3 import DataType, parse_node_type_array
from zarr.core.metadata.v3 import parse_node_type_array
from zarr.core.sync import sync
from zarr.errors import MetadataValidationError
from zarr.registry import (
_parse_array_array_codec,
_parse_array_bytes_codec,
_parse_bytes_bytes_codec,
get_data_type_from_numpy,
get_pipeline_class,
)
from zarr.storage._common import StorePath, ensure_no_existing_node, make_store_path
Expand Down Expand Up @@ -578,7 +580,7 @@
"""
store_path = await make_store_path(store)

dtype_parsed = parse_dtype(dtype, zarr_format)
dtype_parsed = parse_dtype(dtype, zarr_format=zarr_format)

Check warning on line 583 in src/zarr/core/array.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/array.py#L583

Added line #L583 was not covered by tests
shape = parse_shapelike(shape)

if chunks is not None and chunk_shape is not None:
Expand Down Expand Up @@ -677,7 +679,7 @@
"""

shape = parse_shapelike(shape)
codecs = list(codecs) if codecs is not None else _get_default_codecs(np.dtype(dtype))
codecs = list(codecs) if codecs is not None else _get_default_codecs(dtype)
chunk_key_encoding_parsed: ChunkKeyEncodingLike
if chunk_key_encoding is None:
chunk_key_encoding_parsed = {"name": "default", "separator": "/"}
Expand All @@ -691,13 +693,23 @@
category=UserWarning,
stacklevel=2,
)

# resolve the numpy dtype into zarr v3 datatype
zarr_data_type = get_data_type_from_numpy(dtype)

if fill_value is None:
# v3 spec will not allow a null fill value
fill_value_parsed = zarr_data_type.default_value
else:
fill_value_parsed = fill_value
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should (or are) we be casting this scalar somewhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, we can use dtype.cast_value() here.


chunk_grid_parsed = RegularChunkGrid(chunk_shape=chunk_shape)
return ArrayV3Metadata(
shape=shape,
data_type=dtype,
data_type=zarr_data_type,
chunk_grid=chunk_grid_parsed,
chunk_key_encoding=chunk_key_encoding_parsed,
fill_value=fill_value,
fill_value=fill_value_parsed,
codecs=codecs,
dimension_names=tuple(dimension_names) if dimension_names else None,
attributes=attributes or {},
Expand Down Expand Up @@ -1682,7 +1694,7 @@
def _info(
self, count_chunks_initialized: int | None = None, count_bytes_stored: int | None = None
) -> Any:
_data_type: np.dtype[Any] | DataType
_data_type: np.dtype[Any] | DTypeWrapper

Check warning on line 1697 in src/zarr/core/array.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/array.py#L1697

Added line #L1697 was not covered by tests
if isinstance(self.metadata, ArrayV2Metadata):
_data_type = self.metadata.dtype
else:
Expand Down Expand Up @@ -4203,17 +4215,11 @@
"""
Get the default ArrayArrayCodecs, ArrayBytesCodec, and BytesBytesCodec for a given dtype.
"""
dtype = DataType.from_numpy(np_dtype)
if dtype == DataType.string:
dtype_key = "string"
elif dtype == DataType.bytes:
dtype_key = "bytes"
else:
dtype_key = "numeric"
dtype = get_data_type_from_numpy(np_dtype)

default_filters = zarr_config.get("array.v3_default_filters").get(dtype_key)
default_serializer = zarr_config.get("array.v3_default_serializer").get(dtype_key)
default_compressors = zarr_config.get("array.v3_default_compressors").get(dtype_key)
default_filters = zarr_config.get("array.v3_default_filters").get(dtype.kind)
default_serializer = zarr_config.get("array.v3_default_serializer").get(dtype.kind)
default_compressors = zarr_config.get("array.v3_default_compressors").get(dtype.kind)

filters = tuple(_parse_array_array_codec(codec_dict) for codec_dict in default_filters)
serializer = _parse_array_bytes_codec(default_serializer)
Expand Down
4 changes: 2 additions & 2 deletions src/zarr/core/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
import numpy as np

from zarr.core.config import config as zarr_config
from zarr.core.strings import _STRING_DTYPE
from zarr.core.strings import _VLEN_STRING_DTYPE

if TYPE_CHECKING:
from collections.abc import Awaitable, Callable, Iterator
Expand Down Expand Up @@ -173,7 +173,7 @@
# special case as object
return np.dtype("object")
else:
return _STRING_DTYPE
return _VLEN_STRING_DTYPE

Check warning on line 176 in src/zarr/core/common.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/common.py#L176

Added line #L176 was not covered by tests
return np.dtype(dtype)


Expand Down
6 changes: 5 additions & 1 deletion src/zarr/core/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -88,13 +88,17 @@ def enable_gpu(self) -> ConfigSet:
"bytes": [{"id": "vlen-bytes"}],
"raw": None,
},
"v3_default_filters": {"numeric": [], "string": [], "bytes": []},
"v3_default_filters": {"boolean": [], "numeric": [], "string": [], "bytes": []},
"v3_default_serializer": {
"boolean": {"name": "bytes", "configuration": {"endian": "little"}},
"numeric": {"name": "bytes", "configuration": {"endian": "little"}},
"string": {"name": "vlen-utf8"},
"bytes": {"name": "vlen-bytes"},
},
"v3_default_compressors": {
"boolean": [
{"name": "zstd", "configuration": {"level": 0, "checksum": False}},
],
"numeric": [
{"name": "zstd", "configuration": {"level": 0, "checksum": False}},
],
Expand Down
3 changes: 3 additions & 0 deletions src/zarr/core/dtype/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
from zarr.core.dtype.core import ZarrDType

Check warning on line 1 in src/zarr/core/dtype/__init__.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/__init__.py#L1

Added line #L1 was not covered by tests

__all__ = ["ZarrDType"]

Check warning on line 3 in src/zarr/core/dtype/__init__.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/__init__.py#L3

Added line #L3 was not covered by tests
196 changes: 196 additions & 0 deletions src/zarr/core/dtype/core.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,196 @@
"""
# Overview

This module provides a proof-of-concept standalone interface for managing dtypes in the zarr-python codebase.

The `ZarrDType` class introduced in this module effectively acts as a replacement for `np.dtype` throughout the
zarr-python codebase. It attempts to encapsulate all relevant runtime information necessary for working with
dtypes in the context of the Zarr V3 specification (e.g. is this a core dtype or not, how many bytes and what
endianness is the dtype etc). By providing this abstraction, the module aims to:

- Simplify dtype management within zarr-python
- Support runtime flexibility and custom extensions
- Remove unnecessary dependencies on the numpy API

## Extensibility

The module attempts to support user-driven extensions, allowing developers to introduce custom dtypes
without requiring immediate changes to zarr-python. Extensions can leverage the current entrypoint mechanism,
enabling integration of experimental features. Over time, widely adopted extensions may be formalized through
inclusion in zarr-python or standardized via a Zarr Enhancement Proposal (ZEP), but this is not essential.

## Examples

### Core `dtype` Registration

The following example demonstrates how to register a built-in `dtype` in the core codebase:

```python
from zarr.core.dtype import ZarrDType
from zarr.registry import register_v3dtype

class Float16(ZarrDType):
zarr_spec_format = "3"
experimental = False
endianness = "little"
byte_count = 2
to_numpy = np.dtype('float16')

register_v3dtype(Float16)
```

### Entrypoint Extension

The following example demonstrates how users can register a new `bfloat16` dtype for Zarr.
This approach adheres to the existing Zarr entrypoint pattern as much as possible, ensuring
consistency with other extensions. The code below would typically be part of a Python package
that specifies the entrypoints for the extension:

```python
import ml_dtypes
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype

class Bfloat16(ZarrDType):
zarr_spec_format = "3"
experimental = True
endianness = "little"
byte_count = 2
to_numpy = np.dtype('bfloat16') # Enabled by importing ml_dtypes
configuration_v3 = {
"version": "example_value",
"author": "example_value",
"ml_dtypes_version": "example_value"
}
```

### dtype lookup

The following examples demonstrate how to perform a lookup for the relevant ZarrDType, given
a string that matches the dtype Zarr specification ID, or a numpy dtype object:

```
from zarr.registry import get_v3dtype_class, get_v3dtype_class_from_numpy

get_v3dtype_class('complex64') # returns little-endian Complex64 ZarrDType
get_v3dtype_class('not_registered_dtype') # ValueError

get_v3dtype_class_from_numpy('>i2') # returns big-endian Int16 ZarrDType
get_v3dtype_class_from_numpy(np.dtype('float32')) # returns little-endian Float32 ZarrDType
get_v3dtype_class_from_numpy('i10') # ValueError
```

### String dtypes

The following indicates one possibility for supporting variable-length strings. It is via the
entrypoint mechanism as in a previous example. The Apache Arrow specification does not currently
include a dtype for fixed-length strings (only for fixed-length bytes) and so I am using string
here to implicitly refer to a variable-length string data (there may be some subtleties with codecs
that means this needs to be refined further):

```python
import numpy as np
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype

try:
to_numpy = np.dtypes.StringDType()
except AttributeError:
to_numpy = np.dtypes.ObjectDType()

class String(ZarrDType):
zarr_spec_format = "3"
experimental = True
endianness = 'little'
byte_count = None # None is defined to mean variable
to_numpy = to_numpy
```

### int4 dtype

There is currently considerable interest in the AI community in 'quantising' models - storing
models at reduced precision, while minimising loss of information content. There are a number
of sub-byte dtypes that the community are using e.g. int4. Unfortunately numpy does not
currently have support for handling such sub-byte dtypes in an easy way. However, they can
still be held in a numpy array and then passed (in a zero-copy way) to something like pytorch
which can handle appropriately:

```python
import numpy as np
from zarr.core.dtype import ZarrDType # User inherits from ZarrDType when creating their dtype

class Int4(ZarrDType):
zarr_spec_format = "3"
experimental = True
endianness = 'little'
byte_count = 1 # this is ugly, but I could change this from byte_count to bit_count if there was consensus
to_numpy = np.dtype('B') # could also be np.dtype('V1'), but this would prevent bit-twiddling
configuration_v3 = {
"version": "example_value",
"author": "example_value",
}
```
"""

from __future__ import annotations

Check warning on line 133 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L133

Added line #L133 was not covered by tests

from typing import Any, Literal

Check warning on line 135 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L135

Added line #L135 was not covered by tests

import numpy as np

Check warning on line 137 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L137

Added line #L137 was not covered by tests


class FrozenClassVariables(type):
def __setattr__(cls, attr: str, value: object) -> None:
if hasattr(cls, attr):
raise ValueError(f"Attribute {attr} on ZarrDType class can not be changed once set.")

Check warning on line 143 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L140-L143

Added lines #L140 - L143 were not covered by tests
else:
raise AttributeError(f"'{cls}' object has no attribute '{attr}'")

Check warning on line 145 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L145

Added line #L145 was not covered by tests


class ZarrDType(metaclass=FrozenClassVariables):
zarr_spec_format: Literal["2", "3"] # the version of the zarr spec used
experimental: bool # is this in the core spec or not
endianness: Literal[

Check warning on line 151 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L148-L151

Added lines #L148 - L151 were not covered by tests
"big", "little", None
] # None indicates not defined i.e. single byte or byte strings
byte_count: int | None # None indicates variable count
to_numpy: np.dtype[Any] # may involve installing a a numpy extension e.g. ml_dtypes;

Check warning on line 155 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L154-L155

Added lines #L154 - L155 were not covered by tests

configuration_v3: dict | None # TODO: understand better how this is recommended by the spec

Check warning on line 157 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L157

Added line #L157 was not covered by tests

_zarr_spec_identifier: str # implementation detail used to map to core spec

Check warning on line 159 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L159

Added line #L159 was not covered by tests

def __init_subclass__( # enforces all required fields are set and basic sanity checks

Check warning on line 161 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L161

Added line #L161 was not covered by tests
cls,
**kwargs,
) -> None:
required_attrs = [

Check warning on line 165 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L165

Added line #L165 was not covered by tests
"zarr_spec_format",
"experimental",
"endianness",
"byte_count",
"to_numpy",
]
for attr in required_attrs:
if not hasattr(cls, attr):
raise ValueError(f"{attr} is a required attribute for a Zarr dtype.")

Check warning on line 174 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L172-L174

Added lines #L172 - L174 were not covered by tests

if not hasattr(cls, "configuration_v3"):
cls.configuration_v3 = None

Check warning on line 177 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L176-L177

Added lines #L176 - L177 were not covered by tests

cls._zarr_spec_identifier = (

Check warning on line 179 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L179

Added line #L179 was not covered by tests
"big_" + cls.__qualname__.lower()
if cls.endianness == "big"
else cls.__qualname__.lower()
) # how this dtype is identified in core spec; convention is prefix with big_ for big-endian

cls._validate() # sanity check on basic requirements

Check warning on line 185 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L185

Added line #L185 was not covered by tests

super().__init_subclass__(**kwargs)

Check warning on line 187 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L187

Added line #L187 was not covered by tests

# TODO: add further checks
@classmethod
def _validate(cls):
if cls.byte_count is not None and cls.byte_count <= 0:
raise ValueError("byte_count must be a positive integer.")

Check warning on line 193 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L190-L193

Added lines #L190 - L193 were not covered by tests

if cls.byte_count == 1 and cls.endianness is not None:
raise ValueError("Endianness must be None for single-byte types.")

Check warning on line 196 in src/zarr/core/dtype/core.py

View check run for this annotation

Codecov / codecov/patch

src/zarr/core/dtype/core.py#L195-L196

Added lines #L195 - L196 were not covered by tests
Loading
Loading