Skip to content

BUG: Series.str.isdigit with pyarrow dtype doesn't honor unicode superscripts #61466

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
2 of 3 tasks
GarrettWu opened this issue May 20, 2025 · 2 comments
Open
2 of 3 tasks
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data

Comments

@GarrettWu
Copy link

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd
s = pd.Series(['23', '³', '⅕', ''], dtype=pd.StringDtype(storage="pyarrow"))
s.str.isdigit()


	0
0	True
1	False
2	False
3	False

dtype: boolean

Issue Description

Series.str.isdigit() with pyarrow string dtype doesn't honor unicode superscript/subscript. Which diverges with the public doc. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.isdigit.html#pandas.Series.str.isdigit

The bug only happens in Pyarrow string dtype, Python string dtype behavior is correct.

Expected Behavior

import pandas as pd
s = pd.Series(['23', '³', '⅕', ''], dtype=pd.StringDtype(storage="pyarrow"))
s.str.isdigit()
	0
0	True
1	True
2	False
3	False

dtype: boolean

Installed Versions

INSTALLED VERSIONS

commit : 0691c5c
python : 3.11.12
python-bits : 64
OS : Linux
OS-release : 6.1.123+
Version : #1 SMP PREEMPT_DYNAMIC Sun Mar 30 16:01:29 UTC 2025
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : en_US.UTF-8
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.3
numpy : 2.0.2
pytz : 2025.2
dateutil : 2.9.0.post0
pip : 24.1.2
Cython : 3.0.12
sphinx : 8.2.3
IPython : 7.34.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.13.4
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : None
fsspec : 2025.3.2
html5lib : 1.1
hypothesis : None
gcsfs : 2025.3.2
jinja2 : 3.1.6
lxml.etree : 5.4.0
matplotlib : 3.10.0
numba : 0.60.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : 0.28.1
psycopg2 : 2.9.10
pymysql : None
pyarrow : 18.1.0
pyreadstat : None
pytest : 8.3.5
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.15.3
sqlalchemy : 2.0.40
tables : 3.10.2
tabulate : 0.9.0
xarray : 2025.3.1
xlrd : 2.0.1
xlsxwriter : None
zstandard : 0.23.0
tzdata : 2025.2
qtpy : None
pyqt5 : None

@GarrettWu GarrettWu added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels May 20, 2025
@GarrettWu GarrettWu changed the title BUG: Series.str.isdigit BUG: Series.str.isdigit with pyarrow dtype doesn't honor unicode superscripts May 20, 2025
@rhshadrach
Copy link
Member

Thanks for the report, confirmed on main. Further investigations and PRs to fix are welcome!

@rhshadrach rhshadrach added Strings String extension data type and string data Arrow pyarrow functionality and removed Needs Triage Issue that has not been reviewed by a pandas team member labels May 20, 2025
@iabhi4
Copy link
Contributor

iabhi4 commented May 24, 2025

@rhshadrach The issue stems from pyarrow.compute.utf8_is_digit not recognizing non-ASCII Unicode digits (e.g., '³'). To align with str.isdigit()'s behavior and pandas docs, I propose replacing the Arrow compute call in _str_isdigit() with

def _str_isdigit(self):
        values = self.to_numpy(na_value=None)
        data = []
        mask = []

        for val in values:
            if val is None:
                data.append(False)
                mask.append(True)
            else:
                data.append(val.isdigit())
                mask.append(False)

        from pandas.core.arrays.boolean import BooleanArray
        return BooleanArray(np.array(data, dtype=bool), np.array(mask, dtype=bool))

While this isn’t vectorized, it correctly honors all Unicode digit categories, which aligns with user expectations. Let me know if this workaround is acceptable for now, or if you’d prefer keeping the current Arrow-based behavior and instead clarifying the limitation in the documentation.

Related upstream issue: I’ve confirmed that this is a pyarrow limitation and have raised an enhancement request in the Arrow repo to bring utf8_is_digit in line with str.isdigit().

Optionally, we could also explore reimplementing this in Cython using PyUnicode_READ and Py_UNICODE_ISDIGIT for performance while maintaining Unicode correctness.

Let me know what direction you'd prefer, happy to work on a patch either way

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Projects
None yet
Development

No branches or pull requests

3 participants