Skip to content

[ENH] Add null_flag function #501 #510

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 10 commits into from
Aug 4, 2019
Merged

[ENH] Add null_flag function #501 #510

merged 10 commits into from
Aug 4, 2019

Conversation

anzelpwj
Copy link
Collaborator

@anzelpwj anzelpwj commented Jul 29, 2019

PR Description

Adds a null flag function, which creates a new column in a DataFrame to mark which rows had null values. You can choose a subset of columns to consider as needed.

This PR resolves #501.

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:master, but rather from <your_username>:<branch_name>.
  2. If you're not on the contributors list, add yourself to AUTHORS.rst.
  3. Add a line to CHANGELOG.rst under the latest version header (i.e. the one that is "on deck") describing the contribution.

Quick Check

To do a very quick check that everything is correct, follow these steps below:

  • Run the command make check from pyjanitor's top-level directory. This will automatically run:
    • black formatting
    • pycodestyle checking
    • running the test suite
    • docs build

Once done, please check off the check-box above.

If make check does not work for you, you can execute the commands listed in the Makefile individually.

Code Changes

If you are adding code changes, please ensure the following:

  • Ensure that you have added tests.
  • Run all tests ($ pytest .) locally on your machine.
    • Check to ensure that test coverage covers the lines of code that you have added.
    • Ensure that all tests pass.

Relevant Reviewers

Please tag maintainers to review.

@codecov
Copy link

codecov bot commented Jul 29, 2019

Codecov Report

Merging #510 into dev will increase coverage by 0.03%.
The diff coverage is 95.45%.

@@            Coverage Diff             @@
##              dev     #510      +/-   ##
==========================================
+ Coverage   92.88%   92.92%   +0.03%     
==========================================
  Files          10       10              
  Lines         872      891      +19     
==========================================
+ Hits          810      828      +18     
- Misses         62       63       +1

@anzelpwj
Copy link
Collaborator Author

Just realized I probably need to add some files to the docs folder. Will do that tomorrow.

@ericmjl ericmjl changed the title [ENH] Add null_flag function (See #501) [ENH] Add null_flag function #501 Jul 29, 2019
Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR, @anzelpwj! I think this will be a great addition to the library. From this first-pass review, I have a few requested changes, mostly centering around docstrings + the tests, with the comments coming from a maintainability perspective. Hope you understand 😄.

@ericmjl
Copy link
Member

ericmjl commented Aug 4, 2019

Everything looks great, thanks @anzelpwj!

@ericmjl ericmjl merged commit 0c5219c into pyjanitor-devs:dev Aug 4, 2019
@hectormz
Copy link
Collaborator

hectormz commented Aug 12, 2019

@ericmjl, @anzelpwj I can move this conversation to an issue if more appropriate.

I rebased recently and included this PR. I'm getting some failing tests that were added with this PR:

  • test_non_method_functional
  • test_functional_on_some_columns
  • test_rename_output_column
  • test_functional_on_all_columns

The issue is that flag_nulls() is adding a column with dtype=int32, but the expected dataframe is adding a column that is int64. Not sure why that might only be happening for me, and passed tests for all of you and Azure.

Adding a line like:

expected = expected.astype({"null_flag": "int32"})

fixes the issues. Any thoughts?

I'm on Windows in a pyjanitor venv w/ Python 3.7.4, Pandas 0.24.2

@anzelpwj
Copy link
Collaborator Author

anzelpwj commented Aug 12, 2019

Hmmm, what if we explicitly set check_column_type='equiv' in the dataframes assert? Does that fix it?

@hectormz
Copy link
Collaborator

Hmm, I believe that is the default, and setting it explicitly didn't change for me. So that's even stranger that this is happening.

@anzelpwj
Copy link
Collaborator Author

It is default, but thought we'd check anyway. I'd be fine with this fix, but TBH we should open up a ticket on the Pandas project to list this as a bug with their assert method.

@Ram-N
Copy link
Contributor

Ram-N commented Aug 13, 2019

I am on Windows10 and getting the exact same 4 tests failing, reported by @hectormz above.

 assert_frame_equal(df, expected)
E       AssertionError: Attributes are different
E       
E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

The thing is, I am just getting the dev env set up. I haven't made a single change to the code or to the docs yet.

Is there any way to 'soften' these tests so that int32 and int64 are treated as equivalent dtypes?

@hectormz
Copy link
Collaborator

@anzelpwj it might be an issue with Pandas assert method. @Ram-N , are you using vanilla Python as well? Not anaconda?

@Ram-N
Copy link
Contributor

Ram-N commented Aug 14, 2019

@hectormz I am using Anaconda's Python.

Python 3.7.3 | packaged by conda-forge | (default, Jul  1 2019, 22:01:29) 
[MSC v.1900 64 bit (AMD64)] :: Anaconda, Inc. on win32

@anzelpwj
Copy link
Collaborator Author

Would y'all like me to make the extra PR for the expected = expected.astype({"null_flag": "int32"}) line?

@Ram-N
Copy link
Contributor

Ram-N commented Aug 14, 2019

@anzelpwj Yes, please. That will help me. I can rebase and continue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[ENH] Add ability to flag null values in columns
4 participants