Skip to content

Commit 321a0c4

Browse files
committed
Updated docs.
1 parent f71faa4 commit 321a0c4

File tree

4 files changed

+163
-3
lines changed

4 files changed

+163
-3
lines changed

docs/conf.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646
'sphinx.ext.coverage',
4747
'sphinx.ext.viewcode',
4848
'sphinx.ext.githubpages',
49+
'sphinxcontrib.fulltoc'
4950
]
5051

5152
# Add any paths that contain templates here, relative to this directory.

docs/example.rst

Lines changed: 158 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ This example explicitly uses the `dirty_data.xlsx` file from the `janitor`_ repo
55

66
.. _janitor: https://github.com/sfirke/janitor
77

8+
Introduction to Dirty Data
9+
--------------------------
10+
811
Here's what the dirty dataframe looks like.
912

1013
.. code-block:: python
@@ -62,9 +65,162 @@ Here's what the dirty dataframe looks like.
6265
11 NaN
6366
12 NaN
6467
65-
Notice how there's an entire row of null values (row 7), as well as two columns of null values (`do not edit! --->` and `Certification.2`).
68+
Cleaning Column Names
69+
---------------------
70+
71+
There's a bunch of problems with this data. Firstly, the column names are not lowercase, and they have spaces. This will make it cumbersome to use in a programmatic function. To solve this, we can use the :py:meth:`clean_names` method. Firstly, we pass the dataframe to the :py:class:`janitor.DataFrame()` constructor (just a thin wrapper, really). Then, we call on the :py:meth:`clean_names()` class method.
72+
73+
.. code-block:: python
74+
75+
df_clean = jn.DataFrame(df).clean_names()
76+
print(df_clean.head(2))
77+
78+
Notice now how the column names have been made better.
79+
80+
.. code-block:: none
81+
82+
first_name last_name employee_status subject hire_date %_allocated \
83+
0 Jason Bourne Teacher PE 39690.0 0.75
84+
1 Jason Bourne Teacher Drafting 39690.0 0.25
85+
86+
full_time? do_not_edit!_---> certification certification.1 certification.2
87+
0 Yes NaN Physical ed Theater NaN
88+
1 Yes NaN Physical ed Theater NaN
89+
90+
If you squint at the unclean dataset, you'll notice one row and one column of data that are missing. We can also fix this! Building on top of the code block from above, let's now remove those empty columns using the :py:meth:`remove_empty()` method:
91+
92+
.. code-block:: python
93+
94+
df_clean = jn.DataFrame(df).clean_names().remove_empty()
95+
print(df_clean.head(5))
96+
97+
.. code-block:: none
98+
99+
first_name last_name employee_status subject hire_date %_allocated \
100+
0 Jason Bourne Teacher PE 39690.0 0.75
101+
1 Jason Bourne Teacher Drafting 39690.0 0.25
102+
2 Alicia Keys Teacher Music 37118.0 1.00
103+
3 Ada Lovelace Teacher NaN 27515.0 1.00
104+
4 Desus Nice Administration Dean 41431.0 1.00
105+
106+
full_time? certification certification.1
107+
0 Yes Physical ed Theater
108+
1 Yes Physical ed Theater
109+
2 Yes Instr. music Vocal music
110+
3 Yes PENDING Computers
111+
4 Yes PENDING NaN
112+
113+
Now this is starting to shape up well!
114+
115+
Renaming Individual Columns
116+
---------------------------
117+
118+
Next, let's rename some of the columns. `%_allocated` and `full_time?` contain non-alphanumeric characters, so they make it a bit harder to use. We can rename them using the :py:meth:`rename_column()` method:
119+
120+
.. code-block:: python
121+
122+
df_clean = (jn.DataFrame(df)
123+
.clean_names()
124+
.remove_empty()
125+
.rename_column("%_allocated", "percent_allocated")
126+
.rename_column("full_time?", "full_time"))
127+
128+
print(df_clean.head(5))
129+
130+
.. code-block:: none
131+
132+
first_name last_name employee_status subject hire_date \
133+
0 Jason Bourne Teacher PE 39690.0
134+
1 Jason Bourne Teacher Drafting 39690.0
135+
2 Alicia Keys Teacher Music 37118.0
136+
3 Ada Lovelace Teacher NaN 27515.0
137+
4 Desus Nice Administration Dean 41431.0
138+
139+
percent_allocated full_time certification certification.1
140+
0 0.75 Yes Physical ed Theater
141+
1 0.25 Yes Physical ed Theater
142+
2 1.00 Yes Instr. music Vocal music
143+
3 1.00 Yes PENDING Computers
144+
4 1.00 Yes PENDING NaN
145+
146+
147+
Note how now we have really nice column names! You might be wondering why I'm not modifying the two certifiation columns -- that is the next thing we'll tackle.
148+
149+
Coalescing Columns
150+
------------------
151+
152+
If we look more closely at the two `certification` columns, we'll see that they look like this:
153+
154+
.. code-block:: python
155+
156+
print(df_clean[['certification', 'certification.1']])
157+
158+
.. code-block:: none
159+
160+
certification certification.1
161+
0 Physical ed Theater
162+
1 Physical ed Theater
163+
2 Instr. music Vocal music
164+
3 PENDING Computers
165+
4 PENDING NaN
166+
5 Science 6-12 Physics
167+
6 Science 6-12 Physics
168+
8 NaN English 6-12
169+
9 PENDING NaN
170+
10 Physical ed NaN
171+
11 NaN Political sci.
172+
12 Vocal music English
173+
174+
Rows 8 and 11 have NaN in the left certification column, but have a value in the right certification column. Let's assume for a moment that the left certification column is intended to record the first certification that a teacher had obtained. In this case, the values in the right certification column on rows 8 and 11 should be moved to the first column. Let's do that with Janitor, using the :py:meth:`coalesce()` method, which does the following:
175+
176+
.. code-block:: python
177+
178+
df_clean = (jn.DataFrame(df)
179+
.clean_names()
180+
.remove_empty()
181+
.rename_column("%_allocated", "percent_allocated")
182+
.rename_column("full_time?", "full_time")
183+
.coalesce(columns=['certification', 'certification.1'], new_column_name='certification'))
184+
185+
print(df_clean)
186+
187+
.. code-block:: none
188+
189+
first_name last_name employee_status subject hire_date \
190+
0 Jason Bourne Teacher PE 39690.0
191+
1 Jason Bourne Teacher Drafting 39690.0
192+
2 Alicia Keys Teacher Music 37118.0
193+
3 Ada Lovelace Teacher NaN 27515.0
194+
4 Desus Nice Administration Dean 41431.0
195+
5 Chien-Shiung Wu Teacher Physics 11037.0
196+
6 Chien-Shiung Wu Teacher Chemistry 11037.0
197+
8 James Joyce Teacher English 32994.0
198+
9 Hedy Lamarr Teacher Science 27919.0
199+
10 Carlos Boozer Coach Basketball 42221.0
200+
11 Young Boozer Coach NaN 34700.0
201+
12 Micheal Larsen Teacher English 40071.0
202+
203+
percent_allocated full_time certification
204+
0 0.75 Yes Physical ed
205+
1 0.25 Yes Physical ed
206+
2 1.00 Yes Instr. music
207+
3 1.00 Yes PENDING
208+
4 1.00 Yes PENDING
209+
5 0.50 Yes Science 6-12
210+
6 0.50 Yes Science 6-12
211+
8 0.50 No English 6-12
212+
9 0.50 No PENDING
213+
10 NaN No Physical ed
214+
11 NaN No Political sci.
215+
12 0.80 No Vocal music
216+
217+
Awesome stuff! Now we don't have two columns of scattered data, we have one column of densely populated data.
218+
219+
Dealing with Excel Dates
220+
------------------------
66221

67-
To clean up this data, we can use pyjanitor's functions (which are shamelessly copied from the R package).
222+
Finally, notice how the `hire_date` column isn't date formatted. It's got this weird Excel serialization.
223+
To clean up this data, we can use the :py:meth:`convert_excel_date` method.
68224

69225
.. code-block:: python
70226
@@ -74,7 +230,6 @@ To clean up this data, we can use pyjanitor's functions (which are shamelessly c
74230
.rename_column('%_allocated', 'percent_allocated')
75231
.rename_column('full_time?', 'full_time')
76232
.coalesce(['certification', 'certification.1'], 'certification')
77-
.encode_categorical(['subject', 'employee_status', 'full_time'])
78233
.convert_excel_date('hire_date'))
79234
80235
This gives the output:

environment-dev.yml

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -13,3 +13,6 @@ dependencies:
1313
- recommonmark
1414
- pipreqs
1515
- flake8
16+
- xlrd
17+
- pip:
18+
- sphinxcontrib-fulltoc

requirements.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,4 @@ pytest==3.4.1
22
pandas==0.22.0
33
numpy==1.14.1
44
setuptools==38.5.1
5+
sphinxcontrib.fulltoc

0 commit comments

Comments
 (0)