Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 164 commits into from
Jun 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
164 commits
Select commit Hold shift + click to select a range
f5e3f78
modernize typing
d-v-b Feb 21, 2025
b4e71e2
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Feb 24, 2025
3c50f54
lint
d-v-b Feb 24, 2025
d74e7a4
new dtypes
d-v-b Feb 26, 2025
5000dcb
rename base dtype, change type to kind
d-v-b Feb 26, 2025
9cd5c51
start working on JSON serialization
d-v-b Feb 27, 2025
042fac1
get json de/serialization largely working, and start making tests pass
d-v-b Feb 27, 2025
556e390
tweak json type guards
d-v-b Feb 27, 2025
b588f70
fix dtype sizes, adjust fill value parsing in from_dict, fix tests
d-v-b Feb 27, 2025
4ed41c6
mid-refactor commit
d-v-b Mar 2, 2025
1b2c773
working form for dtype classes
d-v-b Mar 2, 2025
24930b3
remove unused code
d-v-b Mar 2, 2025
703e0e1
use wrap / unwrap instead of to_dtype / from_dtype; push into v2 code…
d-v-b Mar 2, 2025
3c232a4
push into v2
d-v-b Mar 3, 2025
b7fe986
remove endianness kwarg to methods, make it an instance variable instead
d-v-b Mar 3, 2025
d9b44b4
make wrapping safe by default
d-v-b Mar 4, 2025
bf24d69
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 4, 2025
c1a8566
dtype-specific tests
d-v-b Mar 4, 2025
2868994
more tests, fix void type default value logic
d-v-b Mar 5, 2025
9ab0b1e
fix dtype mechanics in bytescodec
d-v-b Mar 5, 2025
e9f5e26
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 5, 2025
6df84a9
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Mar 7, 2025
e14279d
remove __post_init__ magic in favor of more explicit declaration
d-v-b Mar 7, 2025
381a264
fix tests
d-v-b Mar 9, 2025
6a7857b
refactor data types
d-v-b Mar 12, 2025
e8fd72c
start design doc
d-v-b Mar 13, 2025
b22f324
more design doc
d-v-b Mar 13, 2025
b7a231e
update docs
d-v-b Mar 13, 2025
7dfcd0f
fix sphinx warnings
d-v-b Mar 13, 2025
706e6b6
tweak docs
d-v-b Mar 13, 2025
8fbf673
info about v3 data types
d-v-b Mar 13, 2025
e9aff64
adjust note
d-v-b Mar 13, 2025
44e78f5
fix: use unparametrized types in direct assignment
d-v-b Mar 13, 2025
60cac04
start fixing config
d-v-b Mar 17, 2025
120df57
Update src/zarr/core/_info.py
d-v-b Mar 17, 2025
0d9922b
add placeholder disclaimer to v3 data types summary
d-v-b Mar 17, 2025
2075952
make example runnable
d-v-b Mar 17, 2025
44369d6
placeholder section for adding a custom dtype
d-v-b Mar 17, 2025
4f3381f
define native data type and native scalar
d-v-b Mar 17, 2025
c8d7680
update data type names
d-v-b Mar 17, 2025
2a7b5a8
fix config test failures
d-v-b Mar 17, 2025
e855e54
call to_dtype once in blosc evolve_from_array_spec
d-v-b Mar 17, 2025
a2da99a
refactor dtypewrapper -> zdtype
d-v-b Mar 19, 2025
5ea3fa4
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 19, 2025
cbb159d
update code examples in docs; remove native endianness
d-v-b Mar 19, 2025
c506d09
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 19, 2025
bb11867
adjust type annotations
d-v-b Mar 20, 2025
7a619e0
fix info tests to use zdtype
d-v-b Mar 20, 2025
ea2d0bf
remove dead code and add code coverage exemption to zarr format checks
d-v-b Mar 20, 2025
042c9e5
fix: add special check for resolving int32 on windows
d-v-b Mar 20, 2025
def5eb2
add dtype entry point test
d-v-b Mar 20, 2025
1b7273b
remove default parameters for parametric dtypes; add mixin classes fo…
d-v-b Mar 21, 2025
60b2e9d
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 21, 2025
83f508c
Update docs/user-guide/data_types.rst
d-v-b Mar 24, 2025
4ceb6ed
refactor: use inheritance to remove boilerplate in dtype definitions
d-v-b Mar 24, 2025
5b9cff0
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
65f0453
Merge branch 'main' into feat/fixed-length-strings
d-v-b Mar 24, 2025
cb0a7d4
update data types documentation, and expose core/dtype module to autodoc
d-v-b Mar 24, 2025
40f0063
Merge branch 'feat/fixed-length-strings' of github.com:d-v-b/zarr-pyt…
d-v-b Mar 24, 2025
9989c64
add failing endianness round-trip test
d-v-b Mar 24, 2025
a276c84
fix endianness
d-v-b Mar 24, 2025
6285739
additional check in test_explicit_endianness
d-v-b Mar 24, 2025
e9241b9
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Mar 24, 2025
2bffe1a
add failing test for round-tripping vlen strings
d-v-b Mar 24, 2025
aa32271
route object dtype arrays to vlen string dtype when numpy > 2
d-v-b Mar 25, 2025
617d3f0
relax endianness mismatch to a warning instead of an error
d-v-b Mar 25, 2025
2b5fd8f
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
1831f20
use public dtype module for docs instead of special-casing the core d…
d-v-b Mar 25, 2025
a427a16
silence mypy error about array indexing
d-v-b Mar 25, 2025
41d7e58
add release note
d-v-b Mar 25, 2025
c08ffd9
fix doctests, excluding config tests
d-v-b Mar 25, 2025
778d740
revert addition of linkage between dtype endianness and bytes codec e…
d-v-b Mar 26, 2025
269215e
remove Any types
d-v-b Mar 26, 2025
8af0ce4
add docstring for wrapper module
d-v-b Mar 26, 2025
df60d05
simplify config and docs
d-v-b Mar 26, 2025
7f54bbf
update config test
d-v-b Mar 26, 2025
be83f03
fix S dtype test for v2
d-v-b Mar 26, 2025
3979746
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Apr 28, 2025
a210f9f
fully remove v3jsonencoder
d-v-b Apr 28, 2025
8fbd29a
refactor dtype module structure
d-v-b Apr 29, 2025
afc9872
add timedelta64
d-v-b Apr 29, 2025
e1bf901
refactor time dtypes
d-v-b Apr 30, 2025
45f0c88
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 1, 2025
890077e
widen dtype test strategies
d-v-b May 1, 2025
a3f05f0
modify structured dtype fill value rt to avoid to_dict
d-v-b May 2, 2025
4788f05
wip: begin creating isomorphic test suite for dtypes
d-v-b May 2, 2025
d3f9204
finish common tests
d-v-b May 2, 2025
fdf17e3
wip: test infrastructure for dtypes
d-v-b May 7, 2025
4afa42a
wip: use class-based tests for all dtypes
d-v-b May 7, 2025
4990803
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 7, 2025
1458aad
fill out more tests, and adjust sized dtypes
d-v-b May 8, 2025
9673997
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 8, 2025
aa11df4
wip: json schema test
d-v-b May 12, 2025
f706b46
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 12, 2025
52518c2
add casting tests
d-v-b May 13, 2025
4ab1c58
use relative link for changes
d-v-b May 13, 2025
e4c89f3
typo
d-v-b May 13, 2025
e386c2b
make bytes codec dtype logic a bit more literate
d-v-b May 13, 2025
703192c
increase deadline to 500ms
d-v-b May 13, 2025
0fab5e5
fewer commented sections of problematic lru_store_cache section of th…
d-v-b May 13, 2025
2f945bf
add link to gh issue about lru_cache for sharding codec
d-v-b May 13, 2025
63a6af4
attempt to speed up hypothesis tests by reducing max array size
d-v-b May 13, 2025
56e7c84
clean up docs
d-v-b May 13, 2025
eee0d7b
remove placeholder
d-v-b May 13, 2025
1dc8e72
make final example section doctested and more readable
d-v-b May 13, 2025
13ca230
revert change to auto chunking
d-v-b May 13, 2025
2a42205
revert quotation of literal type
d-v-b May 13, 2025
3f775c8
lint
d-v-b May 13, 2025
5320a77
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 13, 2025
b525b8e
fix broken code block
d-v-b May 13, 2025
ec94878
specialize test to handle stringdtype changes coming in numpy 2.3
d-v-b May 13, 2025
3af98aa
add docstring to _TestZDType class
d-v-b May 13, 2025
6388203
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
6ef7924
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
1329c69
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 15, 2025
d8c3672
type hints
d-v-b May 15, 2025
3f4d87a
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 16, 2025
d8a382a
expand changelog
d-v-b May 16, 2025
9aa751b
tweak docstring
d-v-b May 16, 2025
e4a0372
support v3 nan strings in JSON for float dtypes
d-v-b May 19, 2025
8a976d6
revert removal of metadata chunk grid attribute
d-v-b May 21, 2025
be0d2df
use none to denote default fill value; remove old structured tests; u…
d-v-b May 22, 2025
8c90d2c
add item size abstraction
d-v-b May 22, 2025
0fc653f
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b May 22, 2025
7c58f7a
rename fixed-length string dtypes, and be strict about the numpy obje…
d-v-b May 22, 2025
3a21845
remove vestigial use of to_dtype().itemsize()
d-v-b May 22, 2025
ce0afe3
remove another vestigial use of to_dtype().itemsize()
d-v-b May 22, 2025
e67d4dc
emit warning about unstable dtype when serializing Structured dtype t…
d-v-b May 23, 2025
4e2a157
put string dtypes in the strings module
d-v-b May 24, 2025
a1deda6
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 24, 2025
528a942
make tests isomorphic to source code
d-v-b May 24, 2025
c9c8181
remove old string logic
d-v-b May 25, 2025
1cb7734
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b May 26, 2025
d80d565
use scale_factor and unit in cast_value for datetime
d-v-b May 26, 2025
7806563
add regression testing against v2.18
d-v-b May 27, 2025
39219fa
truncate U and S scalars in _cast_value_unsafe
d-v-b May 27, 2025
4a7a550
docstrings and simplification for regression tests
d-v-b May 27, 2025
807c585
changes necessary for linting with regression tests
d-v-b May 27, 2025
5150d60
improve method names, refactor type hints with typeddictionaries, fix…
d-v-b May 29, 2025
9ddbe97
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b May 29, 2025
d6535d6
fix storage info discrepancy in docs
d-v-b May 29, 2025
42e14ef
fix docstring that was troubling sphinx
d-v-b May 29, 2025
3991406
wip: add vlen-bytes
d-v-b May 29, 2025
d7da3d9
add vlen-bytes
d-v-b May 29, 2025
c3c3288
Merge branch 'main' of github.com:zarr-developers/zarr-python into fe…
d-v-b Jun 2, 2025
d1feaee
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 5, 2025
3ef138a
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jun 6, 2025
1f767e4
replace placeholder text with links to a github issue
d-v-b Jun 6, 2025
cf55041
refactor fixed-length bytes dtypes
d-v-b Jun 6, 2025
24b6b35
more v3 unstable dtype warnings, and their exemptions from tests
d-v-b Jun 6, 2025
7f099a2
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
bf7e2c5
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
cbb0b0d
clean up typeddicts
d-v-b Jun 7, 2025
8f3aa68
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 7, 2025
e885869
update docstrings
d-v-b Jun 9, 2025
63de7c4
Update docs/user-guide/data_types.rst
d-v-b Jun 11, 2025
b069d36
refactor wrapper to allow subclasses to freely define their own type …
d-v-b Jun 13, 2025
ae36dbf
Merge branch 'main' of https://github.com/zarr-developers/zarr-python…
d-v-b Jun 13, 2025
a1f2c94
Merge branch 'feat/fixed-length-strings' of https://github.com/d-v-b/…
d-v-b Jun 13, 2025
b2e56c8
make method definition order consistent
d-v-b Jun 14, 2025
d26b695
allow structured scalars to be np.void
d-v-b Jun 14, 2025
49f0062
use a common function signature for from_json by packing the object_c…
d-v-b Jun 15, 2025
70da4da
fix dtype doc example
d-v-b Jun 15, 2025
16b4ac6
Merge branch 'main' into feat/fixed-length-strings
d-v-b Jun 16, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions changes/2874.feature.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
Adds zarr-specific data type classes. This replaces the internal use of numpy data types for zarr
v2 and a fixed set of string enums for zarr v3. This change is largely internal, but it does
change the type of the ``dtype`` and ``data_type`` fields on the ``ArrayV2Metadata`` and
``ArrayV3Metadata`` classes. It also changes the JSON metadata representation of the
variable-length string data type, but the old metadata representation can still be
used when reading arrays. The logic for automatically choosing the chunk encoding for a given data
type has also changed, and this necessitated changes to the ``config`` API.

For more on this new feature, see the `documentation </user-guide/data_types.html>`_
14 changes: 7 additions & 7 deletions docs/user-guide/arrays.rst
Original file line number Diff line number Diff line change
Expand Up @@ -182,7 +182,7 @@ which can be used to print useful diagnostics, e.g.::
>>> z.info
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Fill value : 0
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Expand All @@ -200,7 +200,7 @@ prints additional diagnostics, e.g.::
>>> z.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Fill value : 0
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Expand Down Expand Up @@ -248,7 +248,7 @@ built-in delta filter::
The default compressor can be changed by setting the value of the using Zarr's
:ref:`user-guide-config`, e.g.::

>>> with zarr.config.set({'array.v2_default_compressor.numeric': {'id': 'blosc'}}):
>>> with zarr.config.set({'array.v2_default_compressor.default': {'id': 'blosc'}}):
... z = zarr.create_array(store={}, shape=(100000000,), chunks=(1000000,), dtype='int32', zarr_format=2)
>>> z.filters
()
Expand Down Expand Up @@ -288,7 +288,7 @@ Here is an example using a delta filter with the Blosc compressor::
>>> z.info
Type : Array
Zarr format : 3
Data type : DataType.int32
Data type : Int32(endianness='little')
Fill value : 0
Shape : (10000, 10000)
Chunk shape : (1000, 1000)
Expand Down Expand Up @@ -603,7 +603,7 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
>>> a.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.uint8
Data type : UInt8()
Fill value : 0
Shape : (10000, 10000)
Shard shape : (1000, 1000)
Expand All @@ -612,10 +612,10 @@ Sharded arrays can be created by providing the ``shards`` parameter to :func:`za
Read-only : False
Store type : LocalStore
Filters : ()
Serializer : BytesCodec(endian=<Endian.little: 'little'>)
Serializer : BytesCodec(endian=None)
Compressors : (ZstdCodec(level=0, checksum=False),)
No. bytes : 100000000 (95.4M)
No. bytes stored : 3981552
No. bytes stored : 3981473
Storage ratio : 25.1
Shards Initialized : 100

Expand Down
59 changes: 25 additions & 34 deletions docs/user-guide/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -43,39 +43,30 @@ This is the current default configuration::

>>> zarr.config.pprint()
{'array': {'order': 'C',
'v2_default_compressor': {'bytes': {'checksum': False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I manually set the config to this old default value (which I could do in the current v3 branch), does it work properly after this PR? I guess the bigger question here is, are there any breaking changes to what is/isn't allowed in the config with this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, the config in this PR has undergone breaking changes compared to main. We could make those changes backwards-compatible and add deprecation warnings to deprecated keys but this will require some effort.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, in that case the release notes definitely need expanding a lot to explain what the breaking changes are.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My two cents on breaking changes is we should definitely deprecate where possible, because v3 was already a big breaking change that users (well, at least me 😄 ) are struggling to get used to, so to have more breaking changes without deprecations and migration paths would not be great.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, we just need to sketch out how to do deprecations and and migrations in our (terrible, IMO) config API

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"terrible" is an exaggeration -- our config API works today, but it has some flaws that make me think it should be overhauled

  • it's untyped
  • it uses raw python dictionaries, so we are missing a dynamic layer for adding indirection / deprecation warnings, etc

I'm not sure how many of these things can be addressed within the scope of donfig itself?

'id': 'zstd',
'level': 0},
'numeric': {'checksum': False,
'id': 'zstd',
'level': 0},
'string': {'checksum': False,
'v2_default_compressor': {'default': {'checksum': False,
'id': 'zstd',
'level': 0}},
'v2_default_filters': {'bytes': [{'id': 'vlen-bytes'}],
'numeric': None,
'raw': None,
'string': [{'id': 'vlen-utf8'}]},
'v3_default_compressors': {'bytes': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'numeric': [{'configuration': {'checksum': False,
'level': 0},
'variable-length-string': {'checksum': False,
'id': 'zstd',
'level': 0}},
'v2_default_filters': {'default': None,
'variable-length-string': [{'id': 'vlen-utf8'}]},
'v3_default_compressors': {'default': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}],
'string': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}]},
'v3_default_filters': {'bytes': [], 'numeric': [], 'string': []},
'v3_default_serializer': {'bytes': {'name': 'vlen-bytes'},
'numeric': {'configuration': {'endian': 'little'},
'name': 'bytes'},
'string': {'name': 'vlen-utf8'}},
'write_empty_chunks': False},
'async': {'concurrency': 10, 'timeout': None},
'buffer': 'zarr.core.buffer.cpu.Buffer',
'codec_pipeline': {'batch_size': 1,
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
'variable-length-string': [{'configuration': {'checksum': False,
'level': 0},
'name': 'zstd'}]},
'v3_default_filters': {'default': [], 'variable-length-string': []},
'v3_default_serializer': {'default': {'configuration': {'endian': 'little'},
'name': 'bytes'},
'variable-length-string': {'name': 'vlen-utf8'}},
'write_empty_chunks': False},
'async': {'concurrency': 10, 'timeout': None},
'buffer': 'zarr.core.buffer.cpu.Buffer',
'codec_pipeline': {'batch_size': 1,
'path': 'zarr.core.codec_pipeline.BatchedCodecPipeline'},
'codecs': {'blosc': 'zarr.codecs.blosc.BloscCodec',
'bytes': 'zarr.codecs.bytes.BytesCodec',
'crc32c': 'zarr.codecs.crc32c_.Crc32cCodec',
'endian': 'zarr.codecs.bytes.BytesCodec',
Expand All @@ -85,7 +76,7 @@ This is the current default configuration::
'vlen-bytes': 'zarr.codecs.vlen_utf8.VLenBytesCodec',
'vlen-utf8': 'zarr.codecs.vlen_utf8.VLenUTF8Codec',
'zstd': 'zarr.codecs.zstd.ZstdCodec'},
'default_zarr_format': 3,
'json_indent': 2,
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
'threading': {'max_workers': None}}
'default_zarr_format': 3,
'json_indent': 2,
'ndbuffer': 'zarr.core.buffer.cpu.NDBuffer',
'threading': {'max_workers': None}}
6 changes: 3 additions & 3 deletions docs/user-guide/consolidated_metadata.rst
Original file line number Diff line number Diff line change
Expand Up @@ -47,7 +47,7 @@ that can be used.:
>>> from pprint import pprint
>>> pprint(dict(sorted(consolidated_metadata.items())))
{'a': ArrayV3Metadata(shape=(1,),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(1,)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand All @@ -60,7 +60,7 @@ that can be used.:
node_type='array',
storage_transformers=()),
'b': ArrayV3Metadata(shape=(2, 2),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(2, 2)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand All @@ -73,7 +73,7 @@ that can be used.:
node_type='array',
storage_transformers=()),
'c': ArrayV3Metadata(shape=(3, 3, 3),
data_type=<DataType.float64: 'float64'>,
data_type=Float64(endianness='little'),
chunk_grid=RegularChunkGrid(chunk_shape=(3, 3, 3)),
chunk_key_encoding=DefaultChunkKeyEncoding(name='default',
separator='/'),
Expand Down
172 changes: 172 additions & 0 deletions docs/user-guide/data_types.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,172 @@
Data types
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is a super useful read. I'm wondering what to do with it though. Were you thinking it would go under the Advanced Topics section in the user guide?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong opinion from me. IMO our docs right now are not the most logically organized, so I anticipate some churn there in any case.

==========

Zarr's data type model
----------------------

Every Zarr array has a "data type", which defines the meaning and physical layout of the
array's elements. As Zarr Python is tightly integrated with `NumPy <https://numpy.org/doc/stable/>`_,
it's easy to create arrays with NumPy data types:

.. code-block:: python

>>> import zarr
>>> import numpy as np
>>> z = zarr.create_array(store={}, shape=(10,), dtype=np.dtype('uint8'))
>>> z
<Array memory:... shape=(10,) dtype=uint8>

Unlike NumPy arrays, Zarr arrays are designed to accessed by Zarr
implementations in different programming languages. This means Zarr data types must be interpreted
correctly when clients read an array. Each Zarr data type defines procedures for
encoding and decoding both the data type itself, and scalars from that data type to and from Zarr array metadata. And these serialization procedures
depend on the Zarr format.

Data types in Zarr version 2
-----------------------------

Version 2 of the Zarr format defined its data types relative to
`NumPy's data types <https://numpy.org/doc/2.1/reference/arrays.dtypes.html#data-type-objects-dtype>`_,
and added a few non-NumPy data types as well. Thus the JSON identifier for a NumPy-compatible data
type is just the NumPy ``str`` attribute of that data type:

.. code-block:: python

>>> import zarr
>>> import numpy as np
>>> import json
>>>
>>> store = {}
>>> np_dtype = np.dtype('int64')
>>> z = zarr.create_array(store=store, shape=(1,), dtype=np_dtype, zarr_format=2)
>>> dtype_meta = json.loads(store['.zarray'].to_bytes())["dtype"]
>>> dtype_meta
'<i8'
>>> assert dtype_meta == np_dtype.str

.. note::
The ``<`` character in the data type metadata encodes the
`endianness <https://numpy.org/doc/2.2/reference/generated/numpy.dtype.byteorder.html>`_,
or "byte order", of the data type. Following NumPy's example,
in Zarr version 2 each data type has an endianness where applicable.
However, Zarr version 3 data types do not store endianness information.

In addition to defining a representation of the data type itself (which in the example above was
just a simple string ``"<i8"``), Zarr also
defines a metadata representation for scalars associated with each data type. This is necessary
because Zarr arrays have a ``JSON``-serializable ``fill_value`` attribute that defines a scalar value to use when reading
uninitialized chunks of a Zarr array.
Integer and float scalars are stored as ``JSON`` numbers, except for special floats like ``NaN``,
positive infinity, and negative infinity, which are stored as strings.

More broadly, each Zarr data type defines its own rules for how scalars of that type are stored in
``JSON``.


Data types in Zarr version 3
-----------------------------

Zarr V3 brings several key changes to how data types are represented:

- Zarr V3 identifies the basic data types as strings like ``"int8"``, ``"int16"``, etc.

By contrast, Zarr V2 uses the NumPy character code representation for data types:
In Zarr V2, ``int8`` is represented as ``"|i1"``.
- A Zarr V3 data type does not have endianness. This is a departure from Zarr V2, where multi-byte
data types are defined with endianness information. Instead, Zarr V3 requires that endianness,
where applicable, is specified in the ``codecs`` attribute of array metadata.
- While some Zarr V3 data types are identified by strings, others can be identified by a ``JSON``
object. For example, consider this specification of a ``datetime`` data type:

.. code-block:: json

{
"name": "numpy.datetime64",
"configuration": {
"unit": "s",
"scale_factor": 10
}
}


Zarr V2 generally uses structured string representations to convey the same information. The
data type given in the previous example would be represented as the string ``">M[10s]"`` in
Zarr V2. This is more compact, but can be harder to parse.

For more about data types in Zarr V3, see the
`V3 specification <https://zarr-specs.readthedocs.io/en/latest/v3/data-types/index.html>`_.

Data types in Zarr Python
-------------------------

The two Zarr formats that Zarr Python supports specify data types in two different ways:
data types in Zarr version 2 are encoded as NumPy-compatible strings, while data types in Zarr version
3 are encoded as either strings or ``JSON`` objects,
and the Zarr V3 data types don't have any associated endianness information, unlike Zarr V2 data types.

To abstract over these syntactical and semantic differences, Zarr Python uses a class called
`ZDType <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_ provide Zarr V2 and Zarr V3 compatibility
routines for ""native" data types. In this context, a "native" data type is a Python class,
typically defined in another library, that models an array's data type. For example, ``np.uint8`` is a native
data type defined in NumPy, which Zarr Python wraps with a ``ZDType`` instance called
`UInt8 <../api/zarr/dtype/index.html#zarr.dtype.ZDType>`_.

Each data type supported by Zarr Python is modeled by ``ZDType`` subclass, which provides an
API for the following operations:

- Wrapping / unwrapping a native data type
- Encoding / decoding a data type to / from Zarr V2 and Zarr V3 array metadata.
- Encoding / decoding a scalar value to / from Zarr V2 and Zarr V3 array metadata.


Example Usage
~~~~~~~~~~~~~

Create a ``ZDType`` from a native data type:

.. code-block:: python

>>> from zarr.core.dtype import Int8
>>> import numpy as np
>>> int8 = Int8.from_native_dtype(np.dtype('int8'))

Convert back to native data type:

.. code-block:: python

>>> native_dtype = int8.to_native_dtype()
>>> assert native_dtype == np.dtype('int8')

Get the default scalar value for the data type:

.. code-block:: python

>>> default_value = int8.default_scalar()
>>> assert default_value == np.int8(0)


Serialize to JSON for Zarr V2 and V3

.. code-block:: python

>>> json_v2 = int8.to_json(zarr_format=2)
>>> json_v2
{'name': '|i1', 'object_codec_id': None}
>>> json_v3 = int8.to_json(zarr_format=3)
>>> json_v3
'int8'

Serialize a scalar value to JSON:

.. code-block:: python

>>> json_value = int8.to_json_scalar(42, zarr_format=3)
>>> json_value
42

Deserialize a scalar value from JSON:

.. code-block:: python

>>> scalar_value = int8.from_json_scalar(42, zarr_format=3)
>>> assert scalar_value == np.int8(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting to find a section called "defining a custom data type." But we can do that in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this, would be great to understand how structured dtypes can now be used!

Our current test failures with this branch mainly resolve around them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show me those test failures?

Copy link
Contributor

@ilan-gold ilan-gold Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/scverse/anndata/actions/runs/15413857020/job/43873874823?pr=1995 There are also some errors around stores being append only but I think those are unrelated (need to check).

Quick repro:

import numpy as np, pandas as pd, zarr
from string import ascii_letters

def gen_vstr_recarray(m, n, dtype=None):
    size = m * n
    lengths = np.random.randint(3, 5, size)
    letters = np.array(list(ascii_letters))
    gen_word = lambda l: "".join(np.random.choice(letters, l))
    arr = np.array([gen_word(l) for l in lengths]).reshape(m, n)
    return pd.DataFrame(arr, columns=[gen_word(5) for i in range(n)]).to_records(
        index=False, column_dtypes=dtype
    )
elem = gen_vstr_recarray(6, 5)
f = zarr.open("foo.zarr")
f.create_array("rec", shape=elem.shape, dtype=elem.dtype)

yielding KeyError: '|V40' in the details here:

```pytb --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[3], line 15 13 elem = gen_vstr_recarray(6, 5) 14 f = zarr.open("foo.zarr") ---> 15 f.create_array("rec", shape=elem.shape, dtype=elem.dtype)

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/_compat.py:43, in _deprecate_positional_args.._inner_deprecate_positional_args..inner_f(*args, **kwargs)
41 extra_args = len(args) - len(all_args)
42 if extra_args <= 0:
---> 43 return f(*args, **kwargs)
45 # extra_args > 0
46 args_msg = [
47 f"{name}={arg}"
48 for name, arg in zip(kwonly_args[:extra_args], args[-extra_args:], strict=False)
49 ]

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/group.py:2523, in Group.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
2428 """Create an array within this group.
2429
2430 This method lightly wraps :func:zarr.core.array.create_array.
(...) 2517 AsyncArray
2518 """
2519 compressors = _parse_deprecated_compressor(
2520 compressor, compressors, zarr_format=self.metadata.zarr_format
2521 )
2522 return Array(
-> 2523 self._sync(
2524 self._async_group.create_array(
2525 name=name,
2526 shape=shape,
2527 dtype=dtype,
2528 chunks=chunks,
2529 shards=shards,
2530 fill_value=fill_value,
2531 attributes=attributes,
2532 chunk_key_encoding=chunk_key_encoding,
2533 compressors=compressors,
2534 serializer=serializer,
2535 dimension_names=dimension_names,
2536 order=order,
2537 filters=filters,
2538 overwrite=overwrite,
2539 storage_options=storage_options,
2540 config=config,
2541 )
2542 )
2543 )

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
206 # TODO: refactor this to to take *args and **kwargs and pass those to the method
207 # this should allow us to better type the sync wrapper
--> 208 return sync(
209 coroutine,
210 timeout=config.get("async.timeout"),
211 )

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:119, in _runner(coro)
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/group.py:1111, in AsyncGroup.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
1016 """Create an array within this group.
1017
1018 This method lightly wraps :func:zarr.core.array.create_array.
(...) 1106
1107 """
1108 compressors = _parse_deprecated_compressor(
1109 compressor, compressors, zarr_format=self.metadata.zarr_format
1110 )
-> 1111 return await create_array(
1112 store=self.store_path,
1113 name=name,
1114 shape=shape,
1115 dtype=dtype,
1116 chunks=chunks,
1117 shards=shards,
1118 filters=filters,
1119 compressors=compressors,
1120 serializer=serializer,
1121 fill_value=fill_value,
1122 order=order,
1123 zarr_format=self.metadata.zarr_format,
1124 attributes=attributes,
1125 chunk_key_encoding=chunk_key_encoding,
1126 dimension_names=dimension_names,
1127 storage_options=storage_options,
1128 overwrite=overwrite,
1129 config=config,
1130 )

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4456, in create_array(store, name, shape, dtype, data, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config, write_data)
4451 mode: Literal["a"] = "a"
4453 store_path = await make_store_path(
4454 store, path=name, mode=mode, storage_options=storage_options
4455 )
-> 4456 return await init_array(
4457 store_path=store_path,
4458 shape=shape_parsed,
4459 dtype=dtype_parsed,
4460 chunks=chunks,
4461 shards=shards,
4462 filters=filters,
4463 compressors=compressors,
4464 serializer=serializer,
4465 fill_value=fill_value,
4466 order=order,
4467 zarr_format=zarr_format,
4468 attributes=attributes,
4469 chunk_key_encoding=chunk_key_encoding,
4470 dimension_names=dimension_names,
4471 overwrite=overwrite,
4472 config=config,
4473 )

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4242, in init_array(store_path, shape, dtype, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, overwrite, config)
4230 meta = AsyncArray._create_metadata_v2(
4231 shape=shape_parsed,
4232 dtype=dtype_parsed,
(...) 4239 attributes=attributes,
4240 )
4241 else:
-> 4242 array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
4243 compressors=compressors,
4244 filters=filters,
4245 serializer=serializer,
4246 dtype=dtype_parsed,
4247 )
4248 sub_codecs = cast("tuple[Codec, ...]", (*array_array, array_bytes, *bytes_bytes))
4249 codecs_out: tuple[Codec, ...]

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4684, in _parse_chunk_encoding_v3(compressors, filters, serializer, dtype)
4674 def _parse_chunk_encoding_v3(
4675 *,
4676 compressors: CompressorsLike,
(...) 4679 dtype: np.dtype[Any],
4680 ) -> tuple[tuple[ArrayArrayCodec, ...], ArrayBytesCodec, tuple[BytesBytesCodec, ...]]:
4681 """
4682 Generate chunk encoding classes for v3 arrays with optional defaults.
4683 """
-> 4684 default_array_array, default_array_bytes, default_bytes_bytes = _get_default_chunk_encoding_v3(
4685 dtype
4686 )
4688 if filters is None:
4689 out_array_array: tuple[ArrayArrayCodec, ...] = ()

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4590, in _get_default_chunk_encoding_v3(np_dtype)
4584 def _get_default_chunk_encoding_v3(
4585 np_dtype: np.dtype[Any],
4586 ) -> tuple[tuple[ArrayArrayCodec, ...], ArrayBytesCodec, tuple[BytesBytesCodec, ...]]:
4587 """
4588 Get the default ArrayArrayCodecs, ArrayBytesCodec, and BytesBytesCodec for a given dtype.
4589 """
-> 4590 dtype = DataType.from_numpy(np_dtype)
4591 if dtype == DataType.string:
4592 dtype_key = "string"

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/metadata/v3.py:705, in DataType.from_numpy(cls, dtype)
687 return DataType.string
688 dtype_to_data_type = {
689 "|b1": "bool",
690 "bool": "bool",
(...) 703 "<c16": "complex128",
704 }
--> 705 return DataType[dtype_to_data_type[dtype.str]]

KeyError: '|V40'

</details>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I will try to get this tested and fixed in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to have worked!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-v-b Only thing is that you need specify VLenUtf8 as a filter for the v2 file format case

Copy link
Contributor

@ilan-gold ilan-gold Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/scverse/anndata/pull/1995/files Everything that isn't in the unit tests is what we had to change to work with this PR from before. And structured dtypes are a go for us with v3, things seems to be working.

We still are having some backwards compat issues - I think I had previously posted about them around v2 file format fill values for strings being allowed to be 0. I'm adding the test data here:

import zarr

store = zarr.open(zarr.storage.ZipStore('/path_to/adata.zarr.zip', read_only=True), mode="r")
store["obs"]["__categories/cat_ordered"]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 store["obs"]["__categories/cat_ordered"]

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:1860, in Group.__getitem__(self, path)
   1833 def __getitem__(self, path: str) -> Array | Group:
   1834     """Obtain a group member.
   1835 
   1836     Parameters
   (...)   1858 
   1859     """
-> 1860     obj = self._sync(self._async_group.getitem(path))
   1861     if isinstance(obj, AsyncArray):
   1862         return Array(obj)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
    205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
    206     # TODO: refactor this to to take *args and **kwargs and pass those to the method
    207     # this should allow us to better type the sync wrapper
--> 208     return sync(
    209         coroutine,
    210         timeout=config.get("async.timeout"),
    211     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
    160 return_result = next(iter(finished)).result()
    162 if isinstance(return_result, BaseException):
--> 163     raise return_result
    164 else:
    165     return return_result

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:119, in _runner(coro)
    114 """
    115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    116 exception, the exception will be returned.
    117 """
    118 try:
--> 119     return await coro
    120 except Exception as ex:
    121     return ex

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:692, in AsyncGroup.getitem(self, key)
    690     return self._getitem_consolidated(store_path, key, prefix=self.name)
    691 try:
--> 692     return await get_node(
    693         store=store_path.store, path=store_path.path, zarr_format=self.metadata.zarr_format
    694     )
    695 except FileNotFoundError as e:
    696     raise KeyError(key) from e

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3621, in get_node(store, path, zarr_format)
   3619 match zarr_format:
   3620     case 2:
-> 3621         return await _get_node_v2(store=store, path=path)
   3622     case 3:
   3623         return await _get_node_v3(store=store, path=path)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3576, in _get_node_v2(store, path)
   3561 async def _get_node_v2(store: Store, path: str) -> AsyncArray[ArrayV2Metadata] | AsyncGroup:
   3562     """
   3563     Read a Zarr v2 AsyncArray or AsyncGroup from a path in a Store.
   3564 
   (...)   3574     AsyncArray | AsyncGroup
   3575     """
-> 3576     metadata = await _read_metadata_v2(store=store, path=path)
   3577     return _build_node(store=store, path=path, metadata=metadata)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3468, in _read_metadata_v2(store, path)
   3465     else:
   3466         zmeta = json.loads(zgroup_bytes.to_bytes())
-> 3468 return _build_metadata_v2(zmeta, zattrs)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3524, in _build_metadata_v2(zarr_json, attrs_json)
   3522 match zarr_json:
   3523     case {"shape": _}:
-> 3524         return ArrayV2Metadata.from_dict(zarr_json | {"attributes": attrs_json})
   3525     case _:  # pragma: no cover
   3526         return GroupMetadata.from_dict(zarr_json | {"attributes": attrs_json})

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/metadata/v2.py:163, in ArrayV2Metadata.from_dict(cls, data)
    161 fill_value_encoded = _data.get("fill_value")
    162 if fill_value_encoded is not None:
--> 163     fill_value = dtype.from_json_scalar(fill_value_encoded, zarr_format=2)
    164     _data["fill_value"] = fill_value
    166 # zarr v2 allowed arbitrary keys here.
    167 # We don't want the ArrayV2Metadata constructor to fail just because someone put an
    168 # extra key in the metadata.

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/dtype/npy/string.py:281, in VariableLengthUTF8.from_json_scalar(self, data, zarr_format)
    277 """
    278 Strings pass through
    279 """
    280 if not check_json_str(data):
--> 281     raise TypeError(f"Invalid type: {data}. Expected a string.")
    282 return data

TypeError: Invalid type: 0. Expected a string.

Test data: adata.zarr.zip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should now be fixed, could you try it again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will point at main now @d-v-b thanks!

4 changes: 2 additions & 2 deletions docs/user-guide/groups.rst
Original file line number Diff line number Diff line change
Expand Up @@ -128,7 +128,7 @@ property. E.g.::
>>> bar.info_complete()
Type : Array
Zarr format : 3
Data type : DataType.int64
Data type : Int64(endianness='little')
Fill value : 0
Shape : (1000000,)
Chunk shape : (100000,)
Expand All @@ -145,7 +145,7 @@ property. E.g.::
>>> baz.info
Type : Array
Zarr format : 3
Data type : DataType.float32
Data type : Float32(endianness='little')
Fill value : 0.0
Shape : (1000, 1000)
Chunk shape : (100, 100)
Expand Down
1 change: 1 addition & 0 deletions docs/user-guide/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ User guide

installation
arrays
data_types
groups
attributes
storage
Expand Down
Loading