-
-
Notifications
You must be signed in to change notification settings - Fork 331
refactor v3 data types #2874
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor v3 data types #2874
Conversation
…into feat/fixed-length-strings
copying a comment @nenb made in this zulip discussion:
A feature of the character code is that it provides a way to distinguish parametric types like |
Summarising from a zulip discussion: @nenb: How is the endianness of a dtype handled? @d-v-b: In v3, endianness is specified by the codecs. The exact same data type could be decoded to big or little endian representations based on the state of a codec. This is very much not how v2 does it -- v2 puts the endianness in the dtype. Proposed solution: Make |
Thanks for the summary! I have implemented the proposed solution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I took this for a spin locally. So excited.
Lots of work has gone into this PR. Time to get it out into the wild.
.. code-block:: python | ||
|
||
>>> scalar_value = int8.from_json_scalar(42, zarr_format=3) | ||
>>> assert scalar_value == np.int8(42) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was expecting to find a section called "defining a custom data type." But we can do that in a follow-up PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on this, would be great to understand how structured dtypes can now be used!
Our current test failures with this branch mainly resolve around them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you show me those test failures?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/scverse/anndata/actions/runs/15413857020/job/43873874823?pr=1995 There are also some errors around stores being append only but I think those are unrelated (need to check).
Quick repro:
import numpy as np, pandas as pd, zarr
from string import ascii_letters
def gen_vstr_recarray(m, n, dtype=None):
size = m * n
lengths = np.random.randint(3, 5, size)
letters = np.array(list(ascii_letters))
gen_word = lambda l: "".join(np.random.choice(letters, l))
arr = np.array([gen_word(l) for l in lengths]).reshape(m, n)
return pd.DataFrame(arr, columns=[gen_word(5) for i in range(n)]).to_records(
index=False, column_dtypes=dtype
)
elem = gen_vstr_recarray(6, 5)
f = zarr.open("foo.zarr")
f.create_array("rec", shape=elem.shape, dtype=elem.dtype)
yielding KeyError: '|V40'
in the details here:
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/_compat.py:43, in _deprecate_positional_args.._inner_deprecate_positional_args..inner_f(*args, **kwargs)
41 extra_args = len(args) - len(all_args)
42 if extra_args <= 0:
---> 43 return f(*args, **kwargs)
45 # extra_args > 0
46 args_msg = [
47 f"{name}={arg}"
48 for name, arg in zip(kwonly_args[:extra_args], args[-extra_args:], strict=False)
49 ]
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/group.py:2523, in Group.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
2428 """Create an array within this group.
2429
2430 This method lightly wraps :func:zarr.core.array.create_array
.
(...) 2517 AsyncArray
2518 """
2519 compressors = _parse_deprecated_compressor(
2520 compressor, compressors, zarr_format=self.metadata.zarr_format
2521 )
2522 return Array(
-> 2523 self._sync(
2524 self._async_group.create_array(
2525 name=name,
2526 shape=shape,
2527 dtype=dtype,
2528 chunks=chunks,
2529 shards=shards,
2530 fill_value=fill_value,
2531 attributes=attributes,
2532 chunk_key_encoding=chunk_key_encoding,
2533 compressors=compressors,
2534 serializer=serializer,
2535 dimension_names=dimension_names,
2536 order=order,
2537 filters=filters,
2538 overwrite=overwrite,
2539 storage_options=storage_options,
2540 config=config,
2541 )
2542 )
2543 )
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
206 # TODO: refactor this to to take *args and **kwargs and pass those to the method
207 # this should allow us to better type the sync wrapper
--> 208 return sync(
209 coroutine,
210 timeout=config.get("async.timeout"),
211 )
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/sync.py:119, in _runner(coro)
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/group.py:1111, in AsyncGroup.create_array(self, name, shape, dtype, chunks, shards, filters, compressors, compressor, serializer, fill_value, order, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config)
1016 """Create an array within this group.
1017
1018 This method lightly wraps :func:zarr.core.array.create_array
.
(...) 1106
1107 """
1108 compressors = _parse_deprecated_compressor(
1109 compressor, compressors, zarr_format=self.metadata.zarr_format
1110 )
-> 1111 return await create_array(
1112 store=self.store_path,
1113 name=name,
1114 shape=shape,
1115 dtype=dtype,
1116 chunks=chunks,
1117 shards=shards,
1118 filters=filters,
1119 compressors=compressors,
1120 serializer=serializer,
1121 fill_value=fill_value,
1122 order=order,
1123 zarr_format=self.metadata.zarr_format,
1124 attributes=attributes,
1125 chunk_key_encoding=chunk_key_encoding,
1126 dimension_names=dimension_names,
1127 storage_options=storage_options,
1128 overwrite=overwrite,
1129 config=config,
1130 )
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4456, in create_array(store, name, shape, dtype, data, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, storage_options, overwrite, config, write_data)
4451 mode: Literal["a"] = "a"
4453 store_path = await make_store_path(
4454 store, path=name, mode=mode, storage_options=storage_options
4455 )
-> 4456 return await init_array(
4457 store_path=store_path,
4458 shape=shape_parsed,
4459 dtype=dtype_parsed,
4460 chunks=chunks,
4461 shards=shards,
4462 filters=filters,
4463 compressors=compressors,
4464 serializer=serializer,
4465 fill_value=fill_value,
4466 order=order,
4467 zarr_format=zarr_format,
4468 attributes=attributes,
4469 chunk_key_encoding=chunk_key_encoding,
4470 dimension_names=dimension_names,
4471 overwrite=overwrite,
4472 config=config,
4473 )
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4242, in init_array(store_path, shape, dtype, chunks, shards, filters, compressors, serializer, fill_value, order, zarr_format, attributes, chunk_key_encoding, dimension_names, overwrite, config)
4230 meta = AsyncArray._create_metadata_v2(
4231 shape=shape_parsed,
4232 dtype=dtype_parsed,
(...) 4239 attributes=attributes,
4240 )
4241 else:
-> 4242 array_array, array_bytes, bytes_bytes = _parse_chunk_encoding_v3(
4243 compressors=compressors,
4244 filters=filters,
4245 serializer=serializer,
4246 dtype=dtype_parsed,
4247 )
4248 sub_codecs = cast("tuple[Codec, ...]", (*array_array, array_bytes, *bytes_bytes))
4249 codecs_out: tuple[Codec, ...]
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4684, in _parse_chunk_encoding_v3(compressors, filters, serializer, dtype)
4674 def _parse_chunk_encoding_v3(
4675 *,
4676 compressors: CompressorsLike,
(...) 4679 dtype: np.dtype[Any],
4680 ) -> tuple[tuple[ArrayArrayCodec, ...], ArrayBytesCodec, tuple[BytesBytesCodec, ...]]:
4681 """
4682 Generate chunk encoding classes for v3 arrays with optional defaults.
4683 """
-> 4684 default_array_array, default_array_bytes, default_bytes_bytes = _get_default_chunk_encoding_v3(
4685 dtype
4686 )
4688 if filters is None:
4689 out_array_array: tuple[ArrayArrayCodec, ...] = ()
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/array.py:4590, in _get_default_chunk_encoding_v3(np_dtype)
4584 def _get_default_chunk_encoding_v3(
4585 np_dtype: np.dtype[Any],
4586 ) -> tuple[tuple[ArrayArrayCodec, ...], ArrayBytesCodec, tuple[BytesBytesCodec, ...]]:
4587 """
4588 Get the default ArrayArrayCodecs, ArrayBytesCodec, and BytesBytesCodec for a given dtype.
4589 """
-> 4590 dtype = DataType.from_numpy(np_dtype)
4591 if dtype == DataType.string:
4592 dtype_key = "string"
File ~/Projects/Theis/anndata/venv_13/lib/python3.13/site-packages/zarr/core/metadata/v3.py:705, in DataType.from_numpy(cls, dtype)
687 return DataType.string
688 dtype_to_data_type = {
689 "|b1": "bool",
690 "bool": "bool",
(...) 703 "<c16": "complex128",
704 }
--> 705 return DataType[dtype_to_data_type[dtype.str]]
KeyError: '|V40'
</details>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks! I will try to get this tested and fixed in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems to have worked!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@d-v-b Only thing is that you need specify VLenUtf8
as a filter for the v2 file format case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://github.com/scverse/anndata/pull/1995/files Everything that isn't in the unit tests is what we had to change to work with this PR from before. And structured dtypes are a go for us with v3, things seems to be working.
We still are having some backwards compat issues - I think I had previously posted about them around v2 file format fill values for strings being allowed to be 0. I'm adding the test data here:
import zarr
store = zarr.open(zarr.storage.ZipStore('/path_to/adata.zarr.zip', read_only=True), mode="r")
store["obs"]["__categories/cat_ordered"]
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Cell In[27], line 1
----> 1 store["obs"]["__categories/cat_ordered"]
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:1860, in Group.__getitem__(self, path)
1833 def __getitem__(self, path: str) -> Array | Group:
1834 """Obtain a group member.
1835
1836 Parameters
(...) 1858
1859 """
-> 1860 obj = self._sync(self._async_group.getitem(path))
1861 if isinstance(obj, AsyncArray):
1862 return Array(obj)
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:208, in SyncMixin._sync(self, coroutine)
205 def _sync(self, coroutine: Coroutine[Any, Any, T]) -> T:
206 # TODO: refactor this to to take *args and **kwargs and pass those to the method
207 # this should allow us to better type the sync wrapper
--> 208 return sync(
209 coroutine,
210 timeout=config.get("async.timeout"),
211 )
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:163, in sync(coro, loop, timeout)
160 return_result = next(iter(finished)).result()
162 if isinstance(return_result, BaseException):
--> 163 raise return_result
164 else:
165 return return_result
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/sync.py:119, in _runner(coro)
114 """
115 Await a coroutine and return the result of running it. If awaiting the coroutine raises an
116 exception, the exception will be returned.
117 """
118 try:
--> 119 return await coro
120 except Exception as ex:
121 return ex
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:692, in AsyncGroup.getitem(self, key)
690 return self._getitem_consolidated(store_path, key, prefix=self.name)
691 try:
--> 692 return await get_node(
693 store=store_path.store, path=store_path.path, zarr_format=self.metadata.zarr_format
694 )
695 except FileNotFoundError as e:
696 raise KeyError(key) from e
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3621, in get_node(store, path, zarr_format)
3619 match zarr_format:
3620 case 2:
-> 3621 return await _get_node_v2(store=store, path=path)
3622 case 3:
3623 return await _get_node_v3(store=store, path=path)
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3576, in _get_node_v2(store, path)
3561 async def _get_node_v2(store: Store, path: str) -> AsyncArray[ArrayV2Metadata] | AsyncGroup:
3562 """
3563 Read a Zarr v2 AsyncArray or AsyncGroup from a path in a Store.
3564
(...) 3574 AsyncArray | AsyncGroup
3575 """
-> 3576 metadata = await _read_metadata_v2(store=store, path=path)
3577 return _build_node(store=store, path=path, metadata=metadata)
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3468, in _read_metadata_v2(store, path)
3465 else:
3466 zmeta = json.loads(zgroup_bytes.to_bytes())
-> 3468 return _build_metadata_v2(zmeta, zattrs)
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/group.py:3524, in _build_metadata_v2(zarr_json, attrs_json)
3522 match zarr_json:
3523 case {"shape": _}:
-> 3524 return ArrayV2Metadata.from_dict(zarr_json | {"attributes": attrs_json})
3525 case _: # pragma: no cover
3526 return GroupMetadata.from_dict(zarr_json | {"attributes": attrs_json})
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/metadata/v2.py:163, in ArrayV2Metadata.from_dict(cls, data)
161 fill_value_encoded = _data.get("fill_value")
162 if fill_value_encoded is not None:
--> 163 fill_value = dtype.from_json_scalar(fill_value_encoded, zarr_format=2)
164 _data["fill_value"] = fill_value
166 # zarr v2 allowed arbitrary keys here.
167 # We don't want the ArrayV2Metadata constructor to fail just because someone put an
168 # extra key in the metadata.
File ~/Projects/Theis/anndata/venv/lib/python3.12/site-packages/zarr/core/dtype/npy/string.py:281, in VariableLengthUTF8.from_json_scalar(self, data, zarr_format)
277 """
278 Strings pass through
279 """
280 if not check_json_str(data):
--> 281 raise TypeError(f"Invalid type: {data}. Expected a string.")
282 return data
TypeError: Invalid type: 0. Expected a string.
Test data: adata.zarr.zip
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should now be fixed, could you try it again?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes will point at main now @d-v-b thanks!
|
||
|
||
# TODO: find a better name for this function | ||
def get_data_type_from_native_dtype(dtype: npt.DTypeLike) -> ZDType[TBaseDType, TBaseScalar]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I did a bit of manual QC on this function and it worked exactly as I expected it to in every scenario. 🙌
Co-authored-by: Ryan Abernathey <[email protected]>
thanks @rabernat! I have a final cleanup in the works where I simplify the |
Hi! I'm a NumPy maintainer who has worked a bit on the new user DType system we released in NumPy 2.0. @nenb let me know that this PR is a thing and it's getting merged soon. One of the main things we (the NumPy developers) skipped before releasing user DTypes was adding a way to serialize user DTypes to I wonder if instead of working on improving the This is all just me offering my personal opinion, not an opinion of anyone else involved with NumPy. Nick told me he's going on vacation for a while but he's interested in chatting more. I'm just commenting publicly because it's exciting to me 😃. |
…guards for native dtype and json input
…into feat/fixed-length-strings
…zarr-python into feat/fixed-length-strings
…odec_id in a typeddict for zarr v2 metadata
Thanks for the interest @ngoldbaum! I think it would be super interesting to evaluate Zarr as a "standard" serialization scheme for NumPy arrays.
This is something Zarr has to address for every dtype, but we operate under some constraints that make things simpler for us: all dtypes have to serialize to JSON, and as of Zarr V3 the structure of the JSON form of a dtype is well-defined (either a string or a dict with restricted keys). But I imagine NumPy might reasonably not want to commit to JSON, which could complicate the serialization story a bit. |
I did a final pass of type safety stuff, and I made a substantial change to how zarr v2 data types are serialized to and from JSON. As a reminder, Zarr python 2 supported multiple distinct array data types with the same I have formalized this by making Unfortunately, that piece of data doesn't look exactly like what you would see in This is an internal change that doesn't affect the basic design, so I don't think we need new reviews. Unless any new objections surface, I will merge this sometime tomorrow, and we can get started with the remaining tasks (namely, docs). |
Nice work @d-v-b! |
Huge props to @d-v-b for pulling this across the finish line. This was a huge and highly-impactful lift. 🙌 🙌 🙌 🙌 For those that are following this thread, expect this to come out in Zarr 3.1. @ngoldbaum - thanks for all your work on NumPy. We'd love open the conversation about supporting writing to Zarr directly from NumPy. Can you advise on where the best place to do that is? |
@d-v-b Not sure if it's a bug or not, but here is something interesting. Run this using, say, zarr 3.0.8: import zarr, numcodecs
z = zarr.open("foo.zarr", zarr_format=2)
z.create_array("bar", (10,), dtype=object, filters=[numcodecs.VLenUTF8()], fill_value="")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=object) Then after this PR: import zarr, numcodecs
z = zarr.open("foo.zarr")
z["bar"][...]
# array(['', '', '', '', '', '', '', '', '', ''], dtype=StringDType()) i.e., the dtype is now a string dtype. For us it was a simple fix, but not sure if it is (a) intentional or (b) indicative of some other issue. |
Numpy 2.0 introduced a new data type specifically for variable length strings. that's the |
Agreed. The only reason it came up was that we were implicitly relying on the nullability of the object dtype in strings. So we'll transition now away from that, I think |
I think the numpy string dtype has nullability semantics, that might be worth investigating |
Was this intentionally merged into main, and not the 3.1.0 branch? |
Yes, but we can revisit that decision. I was under the impression that we were going to merge the 3.1 prs directly into main, but if we want to do a 3.0.9 release first, then we can revert this merge. |
👍 - I'll delete the 3.1.0 branch then, and create a 3.0.x branch from before when this was merged. |
As per #2750, we need a new model of data types if we want to support more data types. Accordingly, this PR will refactor data types for the zarr v3 side of the codebase and make them extensible. I would also like to handle v2 as well with the same data structures, and confine the v2 / v3 differences to the places where they vary.
In
main
,all the v3 data types are encoded as variants of an enum (i.e., strings). Enumerating each dtype as a string is cumbersome for datetimes, that are parametrized by a time unit, and plain unworkable for parametric dtypes like fixed-length strings, which are parametrized by their length. This means we need a model of data types that can be parametrized, and I think separate classes is probably the way to go here. Separating the different data types into different objects also gives us a natural way to capture some of the per-data type variability baked into the spec: each data type class can define its own default value, and also define methods for how its scalars should be converted to / from JSON.This is a very rough draft right now -- I'm mostly posting this for visibility as I iterate on it.