Skip to content

[ENH} Conditional Join #861

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 55 commits into from
Aug 20, 2021

Conversation

samukweku
Copy link
Collaborator

PR Description

Please describe the changes proposed in the pull request:

  • Supports non-equi joins
  • Supports combination of equi and non-equi joins
  • join operator can be any of ==, !=, <=, <, >=, >, or a combination.
  • Uses binary search (pandas searchsorted), avoiding a cartesian merge, making it less memory intensive
  • output is similar to what obtains from SQL.
  • inner, left, right joins supported. Full outer join is not supported.

This PR resolves #737 .

Please ensure that you have done the following:

  1. PR in from a fork off your branch. Do not PR from <your_username>:dev, but rather from <your_username>:<feature-branch_name>.
  1. If you're not on the contributors list, add yourself to AUTHORS.rst.
  1. Add a line to CHANGELOG.md under the latest version header (i.e. the one that is "on deck") describing the contribution.
    • Do use some discretion here; if there are multiple PRs that are related, keep them in a single line.

Automatic checks

There will be automatic checks run on the PR. These include:

  • Building a preview of the docs on Netlify
  • Automatically linting the code
  • Making sure the code is documented
  • Making sure that all tests are passed
  • Making sure that code coverage doesn't go down.

Relevant Reviewers

Please tag maintainers to review.

@samukweku samukweku self-assigned this Jul 30, 2021
@codecov
Copy link

codecov bot commented Jul 30, 2021

Codecov Report

Merging #861 (308817b) into dev (7189d6d) will increase coverage by 0.73%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##              dev     #861      +/-   ##
==========================================
+ Coverage   94.92%   95.66%   +0.73%     
==========================================
  Files          19       19              
  Lines        2089     2444     +355     
==========================================
+ Hits         1983     2338     +355     
  Misses        106      106              

Comment on lines 6455 to 6456
The function uses binary search to get these rows, thereby
avoiding a cartesian join - this makes it less memory intensive.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SMAHT!

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tremendous work, @samukweku! I mostly checked for docstrings as I trust the implementation is working (esp. based on the tests). The changes requested here should be pretty minor, and I've included a suggested change to make it easy for you if you decide to accept them.

@samukweku
Copy link
Collaborator Author

@ericmjl @loganthomas @nvamsikrishna05 made some changes; ready for another round of review. Would also appreciate a review of #857

Copy link
Member

@ericmjl ericmjl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All good from my side, @samukweku!

@ericmjl
Copy link
Member

ericmjl commented Aug 15, 2021

@nvamsikrishna05 would you like to take a stab at a review here? Let us know in dev team chat if you'd like guidance on how we do review - though I suspect you might have a lot of experience to share with us rather than the other way around 😄.

@nvamsikrishna05
Copy link
Collaborator

@ericmjl Thanks for confidence 🙂. It's a go ahead from my side. I do code reviews at my job too but almost all of them are either .Net based / JS based code reviews, so I may not be able to provide pythonic code review comments.

@samukweku Looks good. Approving the PR. you can go ahead and merge it.
Just a couple minor pointers you can consider(coming from a OOPS background)

  • Instead of magic numbers and strings you can possibly use Enums in places where-ever applicable like in the operators(">" etc) to enforce more safety.
  • Use sets instead of tuples if all you want to check is existance. Sets are constant time complexity over tuples and list. In your PR it doesn't make a difference since there are hardly 5-6 items, so no need to change.

@samukweku
Copy link
Collaborator Author

@nvamsikrishna05 thanks; could you point me to the lines of code you are referring to? the enums and the set check? thanks

@nvamsikrishna05
Copy link
Collaborator

@samukweku Just marked couple of instances if you want to have a look.

@ericmjl
Copy link
Member

ericmjl commented Aug 19, 2021

@samukweku same here with the conflicts - when we merge, @nvamsikrishna05 would you like to do the honors?

@nvamsikrishna05 nvamsikrishna05 merged commit 73f896c into pyjanitor-devs:dev Aug 20, 2021
@samukweku samukweku deleted the conditional_join_func branch February 17, 2023 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conditional join
3 participants