Skip to content

[ENH] Improve complete function #933

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 36 commits into from
Oct 6, 2021

Conversation

samukweku
Copy link
Collaborator

@samukweku samukweku commented Sep 24, 2021

PR Description

Please describe the changes proposed in the pull request:

  • Simplify logic for complete, using pd.merge
  • Simplify logic for expand_grid, which is also used in complete
  • Returns MultiIndex column DataFrame for expand_grid
  • Users can manipulate columns for expand_grid by collapse_levels or droplevel (power to the user)
  • Speed improvement for complete for scenarios where reindex was previously used
  • column order unchanged for complete
  • use hypothesis as much as possible for testing (credits to @ericmjl )
  • Add example notebook for complete

speed check (expand_grid) ... always with a pinch of salt

data = {'height': [60, 70], 'weight': [100, 140, 180], 'sex': ['Male', 'Female']}
ddata = {key : value*20 for key, value in data.items()}

# this PR
%timeit janitor.expand_grid(others = ddata).droplevel(1, 1)
13.6 ms ± 861 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

# dev
%timeit janitor.expand_grid(others = ddata)
12.6 ms ± 586 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

comparision with pd.merge, how='cross' :

df1 = pd.DataFrame({'left': ['foo', 'bar']})
df2 = pd.DataFrame({'right': [7, 8]})

 df1 = pd.concat([df1]*1_000)
df2 = pd.concat([df2]*1_000)

# PR
 %timeit janitor.expand_grid(others = {'df1':df1, 'df2':df2}).droplevel(0,1)
243 ms ± 22.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# pd.merge
 %timeit df1.merge(df2, how = 'cross')
493 ms ± 65 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

This PR improves complete and expand_grid.

PR Checklist

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.rst.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@codecov
Copy link

codecov bot commented Sep 24, 2021

Codecov Report

Merging #933 (0177518) into dev (67535f0) will increase coverage by 0.03%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #933      +/-   ##
==========================================
+ Coverage   95.87%   95.90%   +0.03%     
==========================================
  Files          19       19              
  Lines        2618     2565      -53     
==========================================
- Hits         2510     2460      -50     
+ Misses        108      105       -3     

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful stuff, thanks @samukweku!

@ericmjl ericmjl merged commit d784261 into pyjanitor-devs:dev Oct 6, 2021
@samukweku samukweku deleted the complete_expand_grid branch October 7, 2021 00:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants