Skip to content

consider removing default compressor / filters / serializer from config #3104

Open
@d-v-b

Description

@d-v-b

Our config right now contains this logic for defining a default encoding scheme for a given data type:

"v2_default_filters": {
"numeric": None,
"string": [{"id": "vlen-utf8"}],
"bytes": [{"id": "vlen-bytes"}],
"raw": None,
},
"v3_default_filters": {"numeric": [], "string": [], "bytes": []},
"v3_default_serializer": {
"numeric": {"name": "bytes", "configuration": {"endian": "little"}},
"string": {"name": "vlen-utf8"},
"bytes": {"name": "vlen-bytes"},
},
"v3_default_compressors": {
"numeric": [
{"name": "zstd", "configuration": {"level": 0, "checksum": False}},
],
"string": [
{"name": "zstd", "configuration": {"level": 0, "checksum": False}},
],
"bytes": [
{"name": "zstd", "configuration": {"level": 0, "checksum": False}},
],
},

This approach is problematic because it requires dividing our data types into separate categories which are not very well defined -- is a fixed-length utf32 data type a "string" or "numeric" type?

Given the changes coming in #2874, I propose the following alteration to our approach here:

  • Pull this stuff out of the config entirely.

  • Confine all this logic to a single function for automatically picking a chunk encoding based on a data type + a requested chunk encoding. This function should also check for incompatibility between a data type and a requested chunk encoding. For example, if someone requests a variable-length string data type but does not specify vlen-utf8 as a serializer, then they should get a clear, early error.

These would be breaking changes, but our current approach is, IMO, unworkable.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew features or improvements

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions