Closed
Description
Describe the bug, including details regarding any error messages, version, and platform.
Description
The utf8_is_digit
kernel in pyarrow.compute
does not fully replicate Python's str.isdigit()
behavior, especially with certain Unicode digit characters.
For example, the character '³'
(U+00B3 SUPERSCRIPT THREE) returns True
with Python’s str.isdigit()
but returns False
when passed to pyarrow.compute.utf8_is_digit
.
This divergence leads to downstream inconsistencies, particularly in pandas when using StringDtype(storage="pyarrow")
.
Reproduction
import pyarrow as pa
import pyarrow.compute as pc
arr = pa.array(['3', '٣', '५', '123', '³'])
print(pc.utf8_is_digit(arr).to_pylist())
Output:
[True, True, True, True, False] # <-- '³' incorrectly returns False
Expected Output (matches str.isdigit()
):
[True, True, True, True, True]
Notes
- The issue seems to stem from the implementation of
IsDigitUnicode::PredicateCharacterAll
not including characters in the Unicode "No" (Number, Other) category, such as superscript digits (³
,²
, etc.). - Python's behavior can be verified as:
print("³".isdigit()) # True
Impact
This affects pandas string operations like .str.isdigit()
when using pyarrow
storage. Python string-based behavior passes, but pyarrow-based behavior fails for characters like '³'
.
System Info
Tested with:
- PyArrow 20.0.0 (pip-installed)
- Pyarrow
main
0.1.dev17578+g218c886 - Python 3.12
- Debian-based Linux (Ubuntu)
Component(s)
Python