Skip to content

refactor v3 data types #2874

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 164 commits into from
Jun 16, 2025
Merged

Conversation

d-v-b
Copy link
Contributor

@d-v-b d-v-b commented Feb 28, 2025

As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.

In main,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.

This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.

@github-actions github-actions bot added the needs release notes Automatically applied to PRs which haven't added release notes label Feb 28, 2025
@d-v-b
Copy link
Contributor Author

d-v-b commented Feb 28, 2025

copying a comment @nenb made in this zulip discussion:

The first thing that caught my eye was that you are using numpy character codes. What was the motivation for this? numpy character codes are not extensible in their current format, and lead to issues like: jax-ml/ml_dtypes#41.

A feature of the character code is that it provides a way to distinguish parametric types like U* from parametrized instances of those types (like U3). Defining a class with the character code U means instances of the class can be initialized with a "length" parameter, and then we can make U2, U3, etc, as instances of the same class. If instead we bind a concrete numpy dtype as class attributes, we need a separate class for U2, U3, etc, which is undesirable. I do think I can work around this, but I figured the explanation might be helpful.

@nenb
Copy link

nenb commented Mar 3, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

@d-v-b
Copy link
Contributor Author

d-v-b commented Mar 4, 2025

Summarising from a zulip discussion:

@nenb: How is the endianness of a dtype handled?

@d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype.

Proposed solution: Make endianness an attribute on in the dtype instance. This will be an implementation detail used by zarr-python to handle endianness, but won't be part of the dtype on disk (as requuired by the spec).

Thanks for the summary! I have implemented the proposed solution.

@d-v-b d-v-b mentioned this pull request Mar 5, 2025
6 tasks
Copy link
Contributor

@rabernat rabernat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took this for a spin locally. So excited.

Lots of work has gone into this PR. Time to get it out into the wild.

.. code-block:: python

>>> scalar_value = int8.from_json_scalar(42, zarr_format=3)
>>> assert scalar_value == np.int8(42)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was expecting to find a section called "defining a custom data type." But we can do that in a follow-up PR.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on this, would be great to understand how structured dtypes can now be used!

Our current test failures with this branch mainly resolve around them

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you show me those test failures?

Copy link
Contributor

@ilan-gold ilan-gold Jun 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/scverse/anndata/actions/runs/15413857020/job/43873874823?pr=1995 There are also some errors around stores being append only but I think those are unrelated (need to check).

Quick repro:

import numpy as np, pandas as pd, zarr
from string import ascii_letters

def gen_vstr_recarray(m, n, dtype=None):
    size = m * n
    lengths = np.random.randint(3, 5, size)
    letters = np.array(list(ascii_letters))
    gen_word = lambda l: "".join(np.random.choice(letters, l))
    arr = np.array([gen_word(l) for l in lengths]).reshape(m, n)
    return pd.DataFrame(arr, columns=[gen_word(5) for i in range(n)]).to_records(
        index=False, column_dtypes=dtype
    )
elem = gen_vstr_recarray(6, 5)
f = zarr.open("foo.zarr")
f.create_array("rec", shape=elem.shape, dtype=elem.dtype)

yielding KeyError: '|V40' in the details here:

```pytb --------------------------------------------------------------------------- KeyError Traceback (most recent call last) Cell In[3], line 15 13 elem = gen_vstr_recarray(6, 5) 14 f = zarr.open("foo.zarr") ---> 15 f.create_array("rec", shape=elem.shape, dtype=elem.dtype)

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/_compat.py:43, in _deprecate_positional_args.._inner_deprecate_positional_args..inner_f(*args, **kwargs)
41 extra_args = len(args) - len(all_args)
42 if extra_args <= 0:
---> 43 return f(*args, **kwargs)
45 # extra_args > 0
46 args_msg = [
47 f"{name}={arg}"
48 for name, arg in zip(kwonly_args[:extra_args], args[-extra_args:], strict=False)
49 ]

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/group.py:2523, in Group.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
2428 """Create an array within this group.
2429
2430 This method lightly wraps :func:zarr.core.array.create_array.
(...) 2517 AsyncArray
2518 """
2519 compressors = _parse_deprecated_compressor(
2520 compressor, compressors, zarr_format=self.metadata.zarr_format
2521 )
2522 return Array(
-> 2523 self._sync(
2524 self._async_group.create_array(
2525 name=name,
2526 shape=shape,
2527 dtype=dtype,
2528 chunks=chunks,
2529 shards=shards,
2530 fill_value=fill_value,
2531 attributes=attributes,
2532 chunk_key_encoding=chunk_key_encoding,
2533 compressors=compressors,
2534 serializer=serializer,
2535 dimension_names=dimension_names,
2536 order=order,
2537 filters=filters,
2538 overwrite=overwrite,
2539 storage_options=storage_options,
2540 config=config,
2541 )
2542 )
2543 )

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
206 # TODO: refactor this to to take *args and **kwargs and pass those to the method
207 # this should allow us to better type the sync wrapper
--> 208 return sync(
209 coroutine,
210 timeout=config.get("async.timeout"),
211 )

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:119, in _runner(coro)
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/group.py:1111, in AsyncGroup.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
1016 """Create an array within this group.
1017
1018 This method lightly wraps :func:zarr.core.array.create_array.
(...) 1106
1107 """
1108 compressors = _parse_deprecated_compressor(
1109 compressor, compressors, zarr_format=self.metadata.zarr_format
1110 )
-> 1111 return await create_array(
1112 store=self.store_path,
1113 name=name,
1114 shape=shape,
1115 dtype=dtype,
1116 chunks=chunks,
1117 shards=shards,
1118 filters=filters,
1119 compressors=compressors,
1120 serializer=serializer,
1121 fill_value=fill_value,
1122 order=order,
1123 zarr_format=self.metadata.zarr_format,
1124 attributes=attributes,
1125 chunk_key_encoding=chunk_key_encoding,
1126 dimension_names=dimension_names,
1127 storage_options=storage_options,
1128 overwrite=overwrite,
1129 config=config,
1130 )

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4456, in create_array(store, name, shape, dtype, data, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config, write_data)
4451 mode: Literal["a"] = "a"
4453 store_path = await make_store_path(
4454 store, path=name, mode=mode, storage_options=storage_options
4455 )
-> 4456 return await init_array(
4457 store_path=store_path,
4458 shape=shape_parsed,
4459 dtype=dtype_parsed,
4460 chunks=chunks,
4461 shards=shards,
4462 filters=filters,
4463 compressors=compressors,
4464 serializer=serializer,
4465 fill_value=fill_value,
4466 order=order,
4467 zarr_format=zarr_format,
4468 attributes=attributes,
4469 chunk_key_encoding=chunk_key_encoding,
4470 dimension_names=dimension_names,
4471 overwrite=overwrite,
4472 config=config,
4473 )

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4242, in init_array(store_path, shape, dtype, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, overwrite, config)
4230 meta = AsyncArray._create_metadata_v2(
4231 shape=shape_parsed,
4232 dtype=dtype_parsed,
(...) 4239 attributes=attributes,
4240 )
4241 else:
-> 4242 array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
4243 compressors=compressors,
4244 filters=filters,
4245 serializer=serializer,
4246 dtype=dtype_parsed,
4247 )
4248 sub_codecs = cast("tuple[Codec, ...]", (*array_array, array_bytes, *bytes_bytes))
4249 codecs_out: tuple[Codec, ...]

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4684, in _parse_chunk_encoding_v3(compressors, filters, serializer, dtype)
4674 def _parse_chunk_encoding_v3(
4675 *,
4676 compressors: CompressorsLike,
(...) 4679 dtype: np.dtype[Any],
4680 ) -> tuple[tuple[ArrayArrayCodec, ...], ArrayBytesCodec, tuple[BytesBytesCodec, ...]]:
4681 """
4682 Generate chunk encoding classes for v3 arrays with optional defaults.
4683 """
-> 4684 default_array_array, default_array_bytes, default_bytes_bytes = _get_default_chunk_encoding_v3(
4685 dtype
4686 )
4688 if filters is None:
4689 out_array_array: tuple[ArrayArrayCodec, ...] = ()

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4590, in _get_default_chunk_encoding_v3(np_dtype)
4584 def _get_default_chunk_encoding_v3(
4585 np_dtype: np.dtype[Any],
4586 ) -> tuple[tuple[ArrayArrayCodec, ...], ArrayBytesCodec, tuple[BytesBytesCodec, ...]]:
4587 """
4588 Get the default ArrayArrayCodecs, ArrayBytesCodec, and BytesBytesCodec for a given dtype.
4589 """
-> 4590 dtype = DataType.from_numpy(np_dtype)
4591 if dtype == DataType.string:
4592 dtype_key = "string"

File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/metadata/v3.py:705, in DataType.from_numpy(cls, dtype)
687 return DataType.string
688 dtype_to_data_type = {
689 "|b1": "bool",
690 "bool": "bool",
(...) 703 "<c16": "complex128",
704 }
--> 705 return DataType[dtype_to_data_type[dtype.str]]

KeyError: '|V40'

</details>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks! I will try to get this tested and fixed in this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems to have worked!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@d-v-b Only thing is that you need specify VLenUtf8 as a filter for the v2 file format case

Copy link
Contributor

@ilan-gold ilan-gold Jun 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://github.com/scverse/anndata/pull/1995/files Everything that isn't in the unit tests is what we had to change to work with this PR from before. And structured dtypes are a go for us with v3, things seems to be working.

We still are having some backwards compat issues - I think I had previously posted about them around v2 file format fill values for strings being allowed to be 0. I'm adding the test data here:

import zarr

store = zarr.open(zarr.storage.ZipStore('/path_to/adata.zarr.zip', read_only=True), mode="r")
store["obs"]["__categories/cat_ordered"]
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[27], line 1
----> 1 store["obs"]["__categories/cat_ordered"]

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:1860, in Group.__getitem__(self, path)
   1833 def __getitem__(self, path: str) -> Array | Group:
   1834     """Obtain a group member.
   1835 
   1836     Parameters
   (...)   1858 
   1859     """
-> 1860     obj = self._sync(self._async_group.getitem(path))
   1861     if isinstance(obj, AsyncArray):
   1862         return Array(obj)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
    205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
    206     # TODO: refactor this to to take *args and **kwargs and pass those to the method
    207     # this should allow us to better type the sync wrapper
--> 208     return sync(
    209         coroutine,
    210         timeout=config.get("async.timeout"),
    211     )

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
    160 return_result = next(iter(finished)).result()
    162 if isinstance(return_result, BaseException):
--> 163     raise return_result
    164 else:
    165     return return_result

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:119, in _runner(coro)
    114 """
    115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
    116 exception, the exception will be returned.
    117 """
    118 try:
--> 119     return await coro
    120 except Exception as ex:
    121     return ex

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:692, in AsyncGroup.getitem(self, key)
    690     return self._getitem_consolidated(store_path, key, prefix=self.name)
    691 try:
--> 692     return await get_node(
    693         store=store_path.store, path=store_path.path, zarr_format=self.metadata.zarr_format
    694     )
    695 except FileNotFoundError as e:
    696     raise KeyError(key) from e

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3621, in get_node(store, path, zarr_format)
   3619 match zarr_format:
   3620     case 2:
-> 3621         return await _get_node_v2(store=store, path=path)
   3622     case 3:
   3623         return await _get_node_v3(store=store, path=path)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3576, in _get_node_v2(store, path)
   3561 async def _get_node_v2(store: Store, path: str) -> AsyncArray[ArrayV2Metadata] | AsyncGroup:
   3562     """
   3563     Read a Zarr v2 AsyncArray or AsyncGroup from a path in a Store.
   3564 
   (...)   3574     AsyncArray | AsyncGroup
   3575     """
-> 3576     metadata = await _read_metadata_v2(store=store, path=path)
   3577     return _build_node(store=store, path=path, metadata=metadata)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3468, in _read_metadata_v2(store, path)
   3465     else:
   3466         zmeta = json.loads(zgroup_bytes.to_bytes())
-> 3468 return _build_metadata_v2(zmeta, zattrs)

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3524, in _build_metadata_v2(zarr_json, attrs_json)
   3522 match zarr_json:
   3523     case {"shape": _}:
-> 3524         return ArrayV2Metadata.from_dict(zarr_json | {"attributes": attrs_json})
   3525     case _:  # pragma: no cover
   3526         return GroupMetadata.from_dict(zarr_json | {"attributes": attrs_json})

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/metadata/v2.py:163, in ArrayV2Metadata.from_dict(cls, data)
    161 fill_value_encoded = _data.get("fill_value")
    162 if fill_value_encoded is not None:
--> 163     fill_value = dtype.from_json_scalar(fill_value_encoded, zarr_format=2)
    164     _data["fill_value"] = fill_value
    166 # zarr v2 allowed arbitrary keys here.
    167 # We don't want the ArrayV2Metadata constructor to fail just because someone put an
    168 # extra key in the metadata.

File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/dtype/npy/string.py:281, in VariableLengthUTF8.from_json_scalar(self, data, zarr_format)
    277 """
    278 Strings pass through
    279 """
    280 if not check_json_str(data):
--> 281     raise TypeError(f"Invalid type: {data}. Expected a string.")
    282 return data

TypeError: Invalid type: 0. Expected a string.

Test data: adata.zarr.zip

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should now be fixed, could you try it again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will point at main now @d-v-b thanks!



# TODO: find a better name for this function
def get_data_type_from_native_dtype(dtype: npt.DTypeLike) -> ZDType[TBaseDType, TBaseScalar]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I did a bit of manual QC on this function and it worked exactly as I expected it to in every scenario. 🙌

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 11, 2025

thanks @rabernat! I have a final cleanup in the works where I simplify the ZDType base class to make the from_* classmethods more type safe, and after that I will update and hit the merge button.

@ngoldbaum
Copy link

Hi! I'm a NumPy maintainer who has worked a bit on the new user DType system we released in NumPy 2.0.

@nenb let me know that this PR is a thing and it's getting merged soon.

One of the main things we (the NumPy developers) skipped before releasing user DTypes was adding a way to serialize user DTypes to npy files. That's still an open question.

I wonder if instead of working on improving the npy file format - which is simple on purpose and beholden to not breaking user parser implementations - we can just point people at zarr instead. Maybe even think about adding zarr-python as an optional NumPy dependency to handle more complicated serialization than what npy can handle.

This is all just me offering my personal opinion, not an opinion of anyone else involved with NumPy. Nick told me he's going on vacation for a while but he's interested in chatting more. I'm just commenting publicly because it's exciting to me 😃.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 15, 2025

Thanks for the interest @ngoldbaum! I think it would be super interesting to evaluate Zarr as a "standard" serialization scheme for NumPy arrays.

One of the main things we (the NumPy developers) skipped before releasing user DTypes was adding a way to serialize user DTypes to npy files. That's still an open question.

This is something Zarr has to address for every dtype, but we operate under some constraints that make things simpler for us: all dtypes have to serialize to JSON, and as of Zarr V3 the structure of the JSON form of a dtype is well-defined (either a string or a dict with restricted keys). But I imagine NumPy might reasonably not want to commit to JSON, which could complicate the serialization story a bit.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 15, 2025

I did a final pass of type safety stuff, and I made a substantial change to how zarr v2 data types are serialized to and from JSON.

As a reminder, Zarr python 2 supported multiple distinct array data types with the same dtype: "|O" field in array metadata. When creating an array with dtype "|O", you had to specify the "object codec" (a codec that could serialize arbitrary python objects) explicitly. Thus, the object codec is effectively part of the Zarr V2 dtype model.

I have formalized this by making ZDType.to_json(zarr_format=2) return a dict with the form {"name": <thing you would see in .zarray under dtype>, "object_codec_id": <name of the object codec> | None}, and similarly ZDType.from_json(data, zarr_format=2) expects a dict with the same form. This keeps the ZDType API simple while capturing the essential structure of a Zarr v2 data type in a single piece of data.

Unfortunately, that piece of data doesn't look exactly like what you would see in .zarray, and for most data types the object_codec_id field is None and conveying very little information. If this feels clunky, that's an honest reflection of the quite clunky data type model used by Zarr V2 :).

This is an internal change that doesn't affect the basic design, so I don't think we need new reviews. Unless any new objections surface, I will merge this sometime tomorrow, and we can get started with the remaining tasks (namely, docs).

@d-v-b d-v-b merged commit 6798466 into zarr-developers:main Jun 16, 2025
30 checks passed
@TomAugspurger
Copy link
Contributor

Nice work @d-v-b!

@jhamman
Copy link
Member

jhamman commented Jun 16, 2025

Huge props to @d-v-b for pulling this across the finish line. This was a huge and highly-impactful lift. 🙌 🙌 🙌 🙌

For those that are following this thread, expect this to come out in Zarr 3.1.


@ngoldbaum - thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is?

@ilan-gold
Copy link
Contributor

@d-v-b Not sure if it's a bug or not, but here is something interesting.

Run this using, say, zarr 3.0.8:

import zarr, numcodecs

z = zarr.open("foo.zarr", zarr_format=2)
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()], fill_value="")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=object)

Then after this PR:

import zarr, numcodecs

z = zarr.open("foo.zarr")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType())

i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue.

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 16, 2025

@d-v-b Not sure if it's a bug or not, but here is something interesting.

Run this using, say, zarr 3.0.8:

import zarr, numcodecs

z = zarr.open("foo.zarr", zarr_format=2)
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()], fill_value="")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=object)

Then after this PR:

import zarr, numcodecs

z = zarr.open("foo.zarr")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType())

i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue.

Numpy 2.0 introduced a new data type specifically for variable length strings. that's the StringDtype() you are seeing in the numpy array. I suspect that prior to my PR, zarr-python was not using StringDtype when reading variable-length strings in zarr v2. But it should have been, IMO, so I think the current behavior is an improvement.

@ilan-gold
Copy link
Contributor

Agreed. The only reason it came up was that we were implicitly relying on the nullability of the object dtype in strings. So we'll transition now away from that, I think

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 16, 2025

I think the numpy string dtype has nullability semantics, that might be worth investigating

@dstansby
Copy link
Contributor

Was this intentionally merged into main, and not the 3.1.0 branch?

@d-v-b
Copy link
Contributor Author

d-v-b commented Jun 17, 2025

Was this intentionally merged into main, and not the 3.1.0 branch?

Yes, but we can revisit that decision. I was under the impression that we were going to merge the 3.1 prs directly into main, but if we want to do a 3.0.9 release first, then we can revert this merge.

@dstansby
Copy link
Contributor

👍 - I'll delete the 3.1.0 branch then, and create a 3.0.x branch from before when this was merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.