From 1b9bb12b24216e4a218144562f9545d4df84de8b Mon Sep 17 00:00:00 2001 From: Isaac Virshup Date: Wed, 28 Aug 2019 17:27:49 +1000 Subject: [PATCH 1/4] DOC: Fix docs on merging categoricals. Fixes #28166. --- doc/source/user_guide/categorical.rst | 10 +++++----- doc/source/user_guide/merging.rst | 2 +- 2 files changed, 6 insertions(+), 6 deletions(-) diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst index 8ca96ba0daa5e..4fc9c09ff9e8e 100644 --- a/doc/source/user_guide/categorical.rst +++ b/doc/source/user_guide/categorical.rst @@ -813,16 +813,16 @@ but the categories of these categoricals need to be the same: res res.dtypes -In this case the categories are not the same, and therefore an error is raised: +If the categories are not exactly the same, merging will coerce the +categoricals to their categories' dtypes: .. ipython:: python df_different = df.copy() df_different["cats"].cat.categories = ["c", "d"] - try: - pd.concat([df, df_different]) - except ValueError as e: - print("ValueError:", str(e)) + res = pd.concat([df, df_different]) + res + res.dtypes The same applies to ``df.append(df_different)``. diff --git a/doc/source/user_guide/merging.rst b/doc/source/user_guide/merging.rst index 4c0d3b75a4f79..dca744827477f 100644 --- a/doc/source/user_guide/merging.rst +++ b/doc/source/user_guide/merging.rst @@ -883,7 +883,7 @@ The merged result: .. note:: The category dtypes must be *exactly* the same, meaning the same categories and the ordered attribute. - Otherwise the result will coerce to ``object`` dtype. + Otherwise the result will coerce to the categories' dtype. .. note:: From 9fb6d67765557a7367b46eaacddfc49396bc9ea1 Mon Sep 17 00:00:00 2001 From: Isaac Virshup Date: Mon, 7 Oct 2019 18:18:46 +1100 Subject: [PATCH 2/4] DOC: Combine concat/ merge sections for categoricals --- doc/source/user_guide/categorical.rst | 91 ++++++++++----------------- 1 file changed, 32 insertions(+), 59 deletions(-) diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst index 4fc9c09ff9e8e..d94ba2f46e3fb 100644 --- a/doc/source/user_guide/categorical.rst +++ b/doc/source/user_guide/categorical.rst @@ -797,34 +797,47 @@ Assigning a ``Categorical`` to parts of a column of other types will use the val df.dtypes .. _categorical.merge: +.. _categorical.concat: -Merging -~~~~~~~ +Merging / Concatenation +~~~~~~~~~~~~~~~~~~~~~~~ -You can concat two ``DataFrames`` containing categorical data together, -but the categories of these categoricals need to be the same: +By default, combining ``Series`` or ``DataFrames`` which contain the same +categories results in ``category`` dtype, otherwise results will depend on the +dtype of the underlying categories. Merges that result in non-categorical +dtypes will likely have higher memory usage. Use ``.astype`` or +``union_categoricals`` to ensure ``category`` results. .. ipython:: python - cat = pd.Series(["a", "b"], dtype="category") - vals = [1, 2] - df = pd.DataFrame({"cats": cat, "vals": vals}) - res = pd.concat([df, df]) - res - res.dtypes + from pandas.api.types import union_categoricals -If the categories are not exactly the same, merging will coerce the -categoricals to their categories' dtypes: + # same categories + s1 = pd.Series(['a', 'b'], dtype='category') + s2 = pd.Series(['a', 'b', 'a'], dtype='category') + pd.concat([s1, s2]) + + # different categories + s3 = pd.Series(['b', 'c'], dtype='category') + pd.concat([s1, s3]) + + pd.concat([s1, s3]).astype('category') + union_categoricals([s1.array, s3.array]) -.. ipython:: python - df_different = df.copy() - df_different["cats"].cat.categories = ["c", "d"] - res = pd.concat([df, df_different]) - res - res.dtypes +Following table summarizes the results of ``Categoricals`` related combinations. -The same applies to ``df.append(df_different)``. ++----------+--------------------------------------------------------+----------------------------+ +| arg1 | arg2 | result | ++==========+========================================================+============================+ +| category | category (identical categories) | category | ++----------+--------------------------------------------------------+----------------------------+ +| category | category (different categories, both not ordered) | object (dtype is inferred) | ++----------+--------------------------------------------------------+----------------------------+ +| category | category (different categories, either one is ordered) | object (dtype is inferred) | ++----------+--------------------------------------------------------+----------------------------+ +| category | not category | object (dtype is inferred) | ++----------+--------------------------------------------------------+----------------------------+ See also the section on :ref:`merge dtypes` for notes about preserving merge dtypes and performance. @@ -920,46 +933,6 @@ the resulting array will always be a plain ``Categorical``: # "b" is coded to 0 throughout, same as c1, different from c2 c.codes -.. _categorical.concat: - -Concatenation -~~~~~~~~~~~~~ - -This section describes concatenations specific to ``category`` dtype. See :ref:`Concatenating objects` for general description. - -By default, ``Series`` or ``DataFrame`` concatenation which contains the same categories -results in ``category`` dtype, otherwise results in ``object`` dtype. -Use ``.astype`` or ``union_categoricals`` to get ``category`` result. - -.. ipython:: python - - # same categories - s1 = pd.Series(['a', 'b'], dtype='category') - s2 = pd.Series(['a', 'b', 'a'], dtype='category') - pd.concat([s1, s2]) - - # different categories - s3 = pd.Series(['b', 'c'], dtype='category') - pd.concat([s1, s3]) - - pd.concat([s1, s3]).astype('category') - union_categoricals([s1.array, s3.array]) - - -Following table summarizes the results of ``Categoricals`` related concatenations. - -+----------+--------------------------------------------------------+----------------------------+ -| arg1 | arg2 | result | -+==========+========================================================+============================+ -| category | category (identical categories) | category | -+----------+--------------------------------------------------------+----------------------------+ -| category | category (different categories, both not ordered) | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ -| category | category (different categories, either one is ordered) | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ -| category | not category | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ - Getting data in/out ------------------- From 2519b2de6e5a3abf54ef950af84485316d0cb80d Mon Sep 17 00:00:00 2001 From: Isaac Virshup Date: Mon, 14 Oct 2019 14:47:21 +1100 Subject: [PATCH 3/4] DOC: Concat categoricals example with numeric result. * Added an examples where categoricals are concatenated which results in a numeric dtype. * Removed a table of examples which seemed confusion (most entries were equivalent, gave misleading typing info). --- doc/source/user_guide/categorical.rst | 20 +++++--------------- 1 file changed, 5 insertions(+), 15 deletions(-) diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst index d94ba2f46e3fb..cee057d03734e 100644 --- a/doc/source/user_guide/categorical.rst +++ b/doc/source/user_guide/categorical.rst @@ -821,24 +821,14 @@ dtypes will likely have higher memory usage. Use ``.astype`` or s3 = pd.Series(['b', 'c'], dtype='category') pd.concat([s1, s3]) + # Output dtype is inferred based on categories values + int_cats = pd.Series([1, 2], dtype="category") + float_cats = pd.Series([3.0, 4.0], dtype="category") + pd.concat([int_cats, float_cats]) + pd.concat([s1, s3]).astype('category') union_categoricals([s1.array, s3.array]) - -Following table summarizes the results of ``Categoricals`` related combinations. - -+----------+--------------------------------------------------------+----------------------------+ -| arg1 | arg2 | result | -+==========+========================================================+============================+ -| category | category (identical categories) | category | -+----------+--------------------------------------------------------+----------------------------+ -| category | category (different categories, both not ordered) | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ -| category | category (different categories, either one is ordered) | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ -| category | not category | object (dtype is inferred) | -+----------+--------------------------------------------------------+----------------------------+ - See also the section on :ref:`merge dtypes` for notes about preserving merge dtypes and performance. From 1392e677edab6774e44a5166d5d1f85161be82c9 Mon Sep 17 00:00:00 2001 From: Isaac Virshup Date: Fri, 8 Nov 2019 18:02:48 +1100 Subject: [PATCH 4/4] Add back table --- doc/source/user_guide/categorical.rst | 16 ++++++++++++++-- 1 file changed, 14 insertions(+), 2 deletions(-) diff --git a/doc/source/user_guide/categorical.rst b/doc/source/user_guide/categorical.rst index cee057d03734e..5443f24161f67 100644 --- a/doc/source/user_guide/categorical.rst +++ b/doc/source/user_guide/categorical.rst @@ -829,8 +829,20 @@ dtypes will likely have higher memory usage. Use ``.astype`` or pd.concat([s1, s3]).astype('category') union_categoricals([s1.array, s3.array]) -See also the section on :ref:`merge dtypes` for notes about preserving merge dtypes and performance. - +The following table summarizes the results of merging ``Categoricals``: + ++-------------------+------------------------+----------------------+-----------------------------+ +| arg1 | arg2 | identical | result | ++===================+========================+======================+=============================+ +| category | category | True | category | ++-------------------+------------------------+----------------------+-----------------------------+ +| category (object) | category (object) | False | object (dtype is inferred) | ++-------------------+------------------------+----------------------+-----------------------------+ +| category (int) | category (float) | False | float (dtype is inferred) | ++-------------------+------------------------+----------------------+-----------------------------+ + +See also the section on :ref:`merge dtypes` for notes about +preserving merge dtypes and performance. .. _categorical.union: