Closed
Description
Brief Description
Hello everyone, I'd like to suggest a new method (I'll tentatively call it flag_nulls
) that adds a new column to the dataframe to indicate if there are null values in the row.
If you are preparing a dataframe for machine learning or something, it's important to fill in the null values with something. However, the fact that the data is null could, in fact, be its own feature--for example, if someone is submitting an insurance claim and they haven't provided important information about themselves, that might be a flag that the claim is fraudulent.
Example API
# flag null values in columns
df.flag_nulls(column_name='null_flag', columns=None)
# column_name gives the name of the new column we generate
# columns is a list of the columns that we check for null values. If columns is None, we use all the columns
df1 = pd.DataFrame({'a': [1, 2, None, 4], 'b': [5.0, None, 7.0, 8.0]}
print df1.flag_nulls()
a | b | null_flag |
---|---|---|
1 | 5.0 | 0 |
2 | None | 1 |
None | 7.0 | 1 |
4 | 8.0 | 0 |
print df1.flag_nulls(columns=['a'])
a | b | null_flag |
---|---|---|
1 | 5.0 | 0 |
2 | None | 0 |
None | 7.0 | 1 |
4 | 8.0 | 0 |
print df1.flag_nulls(columns=['a'], column_name='flag')
a | b | flag |
---|---|---|
1 | 5.0 | 0 |
2 | None | 0 |
None | 7.0 | 1 |
4 | 8.0 | 0 |
Notes
- Computerphile agrees with me on this strategy as well ;)
- I would like to work on this issue if that's acceptable.