Skip to content

[Python] utf8_is_digit in PyArrow doesn't fully match Python's str.isdigit() (e.g., fails for '³') #46589

Closed
@iabhi4

Description

@iabhi4

Describe the bug, including details regarding any error messages, version, and platform.

Description

The utf8_is_digit kernel in pyarrow.compute does not fully replicate Python's str.isdigit() behavior, especially with certain Unicode digit characters.

For example, the character '³' (U+00B3 SUPERSCRIPT THREE) returns True with Python’s str.isdigit() but returns False when passed to pyarrow.compute.utf8_is_digit.

This divergence leads to downstream inconsistencies, particularly in pandas when using StringDtype(storage="pyarrow").


Reproduction

import pyarrow as pa
import pyarrow.compute as pc

arr = pa.array(['3', '٣', '५', '123', '³'])
print(pc.utf8_is_digit(arr).to_pylist())

Output:

[True, True, True, True, False]  # <-- '³' incorrectly returns False

Expected Output (matches str.isdigit()):

[True, True, True, True, True]

Notes

  • The issue seems to stem from the implementation of IsDigitUnicode::PredicateCharacterAll not including characters in the Unicode "No" (Number, Other) category, such as superscript digits (³, ², etc.).
  • Python's behavior can be verified as:
print("³".isdigit())  # True

Impact

This affects pandas string operations like .str.isdigit() when using pyarrow storage. Python string-based behavior passes, but pyarrow-based behavior fails for characters like '³'.


System Info

Tested with:

  • PyArrow 20.0.0 (pip-installed)
  • Pyarrow main 0.1.dev17578+g218c886
  • Python 3.12
  • Debian-based Linux (Ubuntu)

Component(s)

Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions