Skip to content

Commit bfea08b

Browse files
samukwekuericmjl
andauthored
[BUGFIX] Conditional Joins (#910)
* cleanup docs * build logic for interval join * add comments to explain the logic section for interval join build logic * cleanup for conditional_type_check * overlapping intervals affecting performance * all functions use binary search * edits to _equal_indices * code for non_equi pairing * bit of an improvement on strict positions * updates * pair conditions in multiple scenarios * adjustments for multiple conditions * updates * performance improvement for duplicates * docs cleanup * add test for empty conditions * updates * changelog * fix linting * add comments in docs * changelog * fix docstrings for _interval_ranges * duplicate check at .25 Co-authored-by: Eric Ma <[email protected]>
1 parent d7b4968 commit bfea08b

File tree

4 files changed

+530
-309
lines changed

4 files changed

+530
-309
lines changed

CHANGELOG.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,7 @@
44
- [INF] Simplify a bit linting, use pre-commit as the CI linting checker. @Zeroto521
55
- [ENH] Fix bug in `pivot_longer` for wrong output when `names_pattern` is a sequence with a single value. Issue #885 @samukweku
66
- [ENH] Deprecate `aggfunc` from `pivot_wider`; aggregation can be chained with pandas' `groupby`.
7+
- [BUG] Fix conditional join issue for multiple conditions, where pd.eval fails to evaluate if numexpr is installed. #898 @samukweku
78

89
## [v0.21.1] - 2021-08-29
910

janitor/functions.py

Lines changed: 25 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -6558,11 +6558,15 @@ def conditional_join(
65586558
and non-equi joins.
65596559
65606560
If the join is solely on equality, `pd.merge` function
6561-
is more efficient and should be used instead.
6561+
is more efficient and should be used instead. Infact,
6562+
for multiple conditions where equality is involved,
6563+
a `pd.merge`, followed by filter(via `query` or `loc`)
6564+
is more efficient. This is even more evident when joining
6565+
on strings.
65626566
If you are interested in nearest joins, or rolling joins,
65636567
`pd.merge_asof` covers that. There is also the IntervalIndex,
6564-
which can be more efficient for range joins, if the intervals
6565-
do not overlap.
6568+
which can be more efficient for range joins, especially if
6569+
the intervals do not overlap.
65666570
65676571
This function returns rows, if any, where values from `df` meet the
65686572
condition(s) for values from `right`. The conditions are passed in
@@ -6573,11 +6577,8 @@ def conditional_join(
65736577
65746578
The operator can be any of `==`, `!=`, `<=`, `<`, `>=`, `>`.
65756579
6576-
If the join operator is a non-equi operator, a binary search is used
6577-
to get the relevant rows; this avoids a cartesian join, and makes the
6578-
process less memory intensive. If it is an equality operator, it simply
6579-
uses pandas' `merge` or `get_indexer_for` method to retrieve the relevant
6580-
rows.
6580+
A binary search is used to get the relevant rows; this avoids
6581+
a cartesian join, and makes the process less memory intensive.
65816582
65826583
The join is done only on the columns.
65836584
MultiIndex columns are not supported.
@@ -6617,7 +6618,7 @@ def conditional_join(
66176618
Join on equi and non-equi operators is possible::
66186619
66196620
df1.conditional_join(
6620-
right = df2,
6621+
df2,
66216622
('id', 'id', '=='),
66226623
('value_1', 'value_2A', '>='),
66236624
('value_1', 'value_2B', '<='),
@@ -6634,7 +6635,7 @@ def conditional_join(
66346635
The default join is `inner`. left and right joins are supported as well::
66356636
66366637
df1.conditional_join(
6637-
right = df2,
6638+
df2,
66386639
('id', 'id', '=='),
66396640
('value_1', 'value_2A', '>='),
66406641
('value_1', 'value_2B', '<='),
@@ -6653,7 +6654,7 @@ def conditional_join(
66536654
66546655
66556656
df1.conditional_join(
6656-
right = df2,
6657+
df2,
66576658
('id', 'id', '=='),
66586659
('value_1', 'value_2A', '>='),
66596660
('value_1', 'value_2B', '<='),
@@ -6675,7 +6676,7 @@ def conditional_join(
66756676
Join on just the non-equi joins is also possible::
66766677
66776678
df1.conditional_join(
6678-
right = df2,
6679+
df2,
66796680
('value_1', 'value_2A', '>'),
66806681
('value_1', 'value_2B', '<'),
66816682
how='inner',
@@ -6695,7 +6696,7 @@ def conditional_join(
66956696
relevant dataframe::
66966697
66976698
df1.conditional_join(
6698-
right = df2,
6699+
df2,
66996700
('value_1', 'value_2A', '>'),
67006701
('value_1', 'value_2B', '<'),
67016702
how='inner',
@@ -6714,7 +6715,7 @@ def conditional_join(
67146715
Pandas merge/join is more efficient::
67156716
67166717
df1.conditional_join(
6717-
right = df2,
6718+
df2,
67186719
('col_a', 'col_a', '=='),
67196720
sort_by_appearance = True
67206721
)
@@ -6726,7 +6727,7 @@ def conditional_join(
67266727
Join on not equal -> ``!=`` ::
67276728
67286729
df1.conditional_join(
6729-
right = df2,
6730+
df2,
67306731
('col_a', 'col_a', '!='),
67316732
sort_by_appearance = True
67326733
)
@@ -6746,7 +6747,7 @@ def conditional_join(
67466747
(this is the default)::
67476748
67486749
df1.conditional_join(
6749-
right = df2,
6750+
df2,
67506751
('col_a', 'col_a', '>'),
67516752
sort_by_appearance = False
67526753
)
@@ -6768,6 +6769,11 @@ def conditional_join(
67686769
.. note:: All the columns from `df` and `right`
67696770
are returned in the final output.
67706771
6772+
.. note:: For multiple condtions, If there are nulls
6773+
in the join columns, they will not be
6774+
preserved for `!=` operator. Nulls are only
6775+
preserved for `!=` operator for single condition.
6776+
67716777
Functional usage syntax:
67726778
67736779
.. code-block:: python
@@ -6779,8 +6785,8 @@ def conditional_join(
67796785
right = pd.DataFrame(...)
67806786
67816787
df = jn.conditional_join(
6782-
df = df,
6783-
right = right,
6788+
df,
6789+
right,
67846790
*conditions,
67856791
sort_by_appearance = True/False,
67866792
suffixes = ("_x", "_y"),
@@ -6791,7 +6797,7 @@ def conditional_join(
67916797
.. code-block:: python
67926798
67936799
df = df.conditional_join(
6794-
right = right,
6800+
right,
67956801
*conditions,
67966802
sort_by_appearance = True/False,
67976803
suffixes = ("_x", "_y"),
@@ -6821,12 +6827,7 @@ def conditional_join(
68216827
At least one of the values must not be ``None``.
68226828
:returns: A pandas DataFrame of the two merged Pandas objects.
68236829
:raises ValueError: if columns from `df` or `right` is a MultiIndex.
6824-
:raises ValueError: if `right` is an unnamed Series.
68256830
:raises ValueError: if condition in *conditions is not a tuple.
6826-
:raises ValueError: if condition is not length 3.
6827-
:raises ValueError: if `left_on` and `right_on` in condition are not
6828-
both numeric, or string, or datetime.
6829-
68306831
68316832
.. # noqa: DAR402
68326833
"""

0 commit comments

Comments
 (0)