Skip to content

[ENH] Add optional removal of accents on functions.clean_names, enabled by default. #506

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jul 28, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions AUTHORS.rst
Original file line number Diff line number Diff line change
Expand Up @@ -76,3 +76,4 @@ Contributors
- `@puruckertom <https://github.com/puruckertom>`_ | `contributions <https://github.com/ericmjl/pyjanitor/pulls?utf8=%E2%9C%93&q=is%3Apr+author%3Apuruckertom>`_
- `@thomasjpfan <https://github.com/thomasjpfan>`_ | `contributions <https://github.com/ericmjl/pyjanitor/issues?q=is%3Aclosed+mentions%3Athomasjpfan>`_
- `@jiafengkevinchen <https://github.com/jiafengkevinchen>`_ | `contributions <https://github.com/ericmjl/pyjanitor/pull/480#issue-298730562>`_
- `@mralbu <https://github.com/mralbu>`_ | `contributions <https://github.com/ericmjl/pyjanitor/issues/502>`_
2 changes: 1 addition & 1 deletion CHANGELOG.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ v0.18.1 (on deck)
- [ENH] add preserve_position kwarg to deconcatenate_column with tests by @shandou and @ericmjl
- [DOC] add contributions that did not leave ``git`` traces by @ericmjl
- [ENH] add inflation adjustment in finance submodule by @rahosbach

- [ENH] add optional removal of accents on functions.clean_names, enabled by default by @mralbu

For changes that happened prior to v0.18.1,
please consult the closed PRs,
Expand Down
17 changes: 17 additions & 0 deletions janitor/functions.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import datetime as dt
import re
import unicodedata
import warnings
from fnmatch import translate
from functools import partial, reduce
Expand Down Expand Up @@ -121,6 +122,7 @@ def clean_names(
strip_underscores: str = None,
case_type: str = "lower",
remove_special: bool = False,
strip_accents: bool = True,
preserve_original_columns: bool = True,
) -> pd.DataFrame:
"""
Expand Down Expand Up @@ -189,6 +191,9 @@ def _remove_special(col):
if remove_special:
df = df.rename(columns=_remove_special)

if strip_accents:
df = df.rename(columns=_strip_accents)

df = df.rename(columns=lambda x: re.sub("_+", "_", x))
df = _strip_underscores(df, strip_underscores)

Expand All @@ -208,6 +213,18 @@ def _normalize_1(col_name: str) -> str:
return result


def _strip_accents(col_name: str) -> str:
"""
Removes accents from a DataFrame column name.
.. _StackOverflow: https://stackoverflow.com/questions/517923/what-is-the-best-way-to-remove-accents-in-a-python-unicode-string # noqa: E501
"""
return "".join(
l
for l in unicodedata.normalize("NFD", col_name)
if not unicodedata.combining(l)
)


@pf.register_dataframe_method
def remove_empty(df: pd.DataFrame) -> pd.DataFrame:
"""
Expand Down
8 changes: 8 additions & 0 deletions tests/functions/test_clean_names.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,6 +104,14 @@ def test_clean_names_strip_underscores(
assert set(df.columns) == set(expected_columns)


@pytest.mark.functions
def test_clean_names_strip_accents():
df = pd.DataFrame({"João": [1, 2], "Лука́ся": [1, 2], "Käfer": [1, 2]})
df = df.clean_names(strip_accents=True)
expected_columns = ["joao", "лукася", "kafer"]
assert set(df.columns) == set(expected_columns)


@pytest.mark.functions
def test_incorrect_strip_underscores(multiindex_dataframe):
with pytest.raises(JanitorError):
Expand Down