Skip to content

Classification tools: use binary search in numpy case #760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 19, 2023

Conversation

thuydotm
Copy link
Contributor

@thuydotm thuydotm commented Jan 18, 2023

Fixes #758

Internally, most of our classification tools call to _bin(). In both NumPy and CuPy cases, it classifies data into different bins by looping through all the bins sequentially. This PR makes change to the way we look for correct bin of a value by using binary search.

  • binary search in NumPy case
  • compare performance between old approach and the new one

@thuydotm thuydotm added the WIP Work in progress label Jan 18, 2023
@codecov-commenter
Copy link

Codecov Report

Base: 79.88% // Head: 79.90% // Increases project coverage by +0.02% 🎉

Coverage data is based on head (d1bbe88) compared to base (af42544).
Patch coverage: 100.00% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master     #760      +/-   ##
==========================================
+ Coverage   79.88%   79.90%   +0.02%     
==========================================
  Files          19       19              
  Lines        4170     4175       +5     
==========================================
+ Hits         3331     3336       +5     
  Misses        839      839              
Impacted Files Coverage Δ
xrspatial/classify.py 78.43% <100.00%> (+0.40%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@thuydotm
Copy link
Contributor Author

Comparing the benchmarking on Numpy case, we got:

                                       ratio
     [d1bbe883]           [af425448]
<reclassify_binary_search> <master>  
         436±30μs         451±20μs     1.04  classify.EqualInterval.time_equal_interval(100, 1, 'numpy')
         524±20μs         562±10μs     1.07  classify.EqualInterval.time_equal_interval(100, 10, 'numpy')
+        683±20μs      1.13±0.03ms     1.66  classify.EqualInterval.time_equal_interval(100, 100, 'numpy')
      3.88±0.09ms      4.23±0.05ms     1.09  classify.EqualInterval.time_equal_interval(1000, 1, 'numpy')
+      10.4±0.2ms       12.5±0.1ms     1.21  classify.EqualInterval.time_equal_interval(1000, 10, 'numpy')
+      21.2±0.3ms         69.1±1ms     3.26  classify.EqualInterval.time_equal_interval(1000, 100, 'numpy')
          687±4ms          717±3ms     1.04  classify.EqualInterval.time_equal_interval(10000, 1, 'numpy')
+      1.31±0.02s       1.55±0.01s     1.18  classify.EqualInterval.time_equal_interval(10000, 10, 'numpy')
+      2.42±0.01s       7.39±0.02s     3.06  classify.EqualInterval.time_equal_interval(10000, 100, 'numpy')
         699±30μs         810±30μs    ~1.16  classify.EqualInterval.time_equal_interval(300, 1, 'numpy')
      1.41±0.03ms      1.63±0.03ms    ~1.16  classify.EqualInterval.time_equal_interval(300, 10, 'numpy')
+     2.47±0.06ms      6.63±0.05ms     2.69  classify.EqualInterval.time_equal_interval(300, 100, 'numpy')
       52.6±0.8ms       54.5±0.7ms     1.04  classify.EqualInterval.time_equal_interval(3000, 1, 'numpy')
+         112±2ms          133±2ms     1.19  classify.EqualInterval.time_equal_interval(3000, 10, 'numpy')
+         209±4ms          651±4ms     3.11  classify.EqualInterval.time_equal_interval(3000, 100, 'numpy')
       29.0±0.6ms       28.5±0.4ms     0.98  classify.NaturalBreaks.time_natural_breaks(100, 1, 'numpy')
          361±3ms          356±4ms     0.99  classify.NaturalBreaks.time_natural_breaks(100, 10, 'numpy')
          468±7ms         457±10ms     0.97  classify.NaturalBreaks.time_natural_breaks(1000, 1, 'numpy')
       5.81±0.03s       5.72±0.01s     0.99  classify.NaturalBreaks.time_natural_breaks(1000, 10, 'numpy')
          4.03±0s          4.03±0s     1.00  classify.NaturalBreaks.time_natural_breaks(10000, 1, 'numpy')
       10.2±0.02s       10.5±0.01s     1.03  classify.NaturalBreaks.time_natural_breaks(10000, 10, 'numpy')
          452±5ms          443±7ms     0.98  classify.NaturalBreaks.time_natural_breaks(300, 1, 'numpy')
       5.74±0.01s       5.67±0.01s     0.99  classify.NaturalBreaks.time_natural_breaks(300, 10, 'numpy')
          651±6ms          641±4ms     0.98  classify.NaturalBreaks.time_natural_breaks(3000, 1, 'numpy')
       6.09±0.01s       5.95±0.02s     0.98  classify.NaturalBreaks.time_natural_breaks(3000, 10, 'numpy')
         628±30μs         628±40μs     1.00  classify.Quantile.time_quantile(100, 1, 'numpy')
         836±30μs         788±20μs     0.94  classify.Quantile.time_quantile(100, 10, 'numpy')
+     1.18±0.06ms      1.55±0.06ms     1.31  classify.Quantile.time_quantile(100, 100, 'numpy')
       5.29±0.1ms       5.68±0.2ms     1.07  classify.Quantile.time_quantile(1000, 1, 'numpy')
       25.2±0.2ms       25.5±0.3ms     1.01  classify.Quantile.time_quantile(1000, 10, 'numpy')
+      44.5±0.2ms       84.8±0.5ms     1.91  classify.Quantile.time_quantile(1000, 100, 'numpy')
          826±4ms          857±3ms     1.04  classify.Quantile.time_quantile(10000, 1, 'numpy')
       2.77±0.01s       2.91±0.01s     1.05  classify.Quantile.time_quantile(10000, 10, 'numpy')
+      4.62±0.01s       8.89±0.01s     1.92  classify.Quantile.time_quantile(10000, 100, 'numpy')
         997±60μs         984±70μs     0.99  classify.Quantile.time_quantile(300, 1, 'numpy')
      2.83±0.03ms      2.80±0.09ms     0.99  classify.Quantile.time_quantile(300, 10, 'numpy')
+      4.71±0.1ms       8.31±0.2ms     1.76  classify.Quantile.time_quantile(300, 100, 'numpy')
       61.5±0.6ms       63.9±0.5ms     1.04  classify.Quantile.time_quantile(3000, 1, 'numpy')
          232±2ms          244±2ms     1.05  classify.Quantile.time_quantile(3000, 10, 'numpy')
+         409±2ms          786±7ms     1.92  classify.Quantile.time_quantile(3000, 100, 'numpy')
         352±30μs         379±10μs     1.08  classify.Reclassify.time_reclassify(100, 1, 'numpy')
          464±4μs         495±20μs     1.07  classify.Reclassify.time_reclassify(100, 10, 'numpy')
+        601±20μs      1.07±0.02ms     1.78  classify.Reclassify.time_reclassify(100, 100, 'numpy')
+      1.94±0.1ms       2.51±0.2ms     1.29  classify.Reclassify.time_reclassify(1000, 1, 'numpy')
+     8.42±0.05ms       11.0±0.1ms     1.30  classify.Reclassify.time_reclassify(1000, 10, 'numpy')
+      19.3±0.2ms       66.8±0.4ms     3.46  classify.Reclassify.time_reclassify(1000, 100, 'numpy')
+         267±2ms          309±2ms     1.16  classify.Reclassify.time_reclassify(10000, 1, 'numpy')
+        886±10ms          1.16±0s     1.31  classify.Reclassify.time_reclassify(10000, 10, 'numpy')
+      1.95±0.01s       7.00±0.02s     3.58  classify.Reclassify.time_reclassify(10000, 100, 'numpy')
         519±30μs         536±10μs     1.03  classify.Reclassify.time_reclassify(300, 1, 'numpy')
      1.20±0.07ms      1.52±0.08ms    ~1.26  classify.Reclassify.time_reclassify(300, 10, 'numpy')
+     2.13±0.03ms       6.65±0.2ms     3.13  classify.Reclassify.time_reclassify(300, 100, 'numpy')
       16.0±0.6ms       19.2±0.4ms    ~1.20  classify.Reclassify.time_reclassify(3000, 1, 'numpy')
+        72.3±1ms         97.1±1ms     1.34  classify.Reclassify.time_reclassify(3000, 10, 'numpy')
+         170±4ms          608±2ms     3.59  classify.Reclassify.time_reclassify(3000, 100, 'numpy')

When the number of bins is 1, the perf is almost the same
When the number of bins is 10, the perf is about equal to ~1.2 times faster
When the number of bins is 100, the perf is about ~1.6-3.5 times faster depending on the size of input data array.

@thuydotm thuydotm changed the title Classification tools: use binary search Classification tools: use binary search in numpy case Jan 18, 2023
@thuydotm thuydotm requested a review from brendancol January 18, 2023 14:53
@thuydotm thuydotm added ready to merge PR is ready to merge and removed WIP Work in progress labels Jan 18, 2023
Copy link
Contributor

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@thuydotm LGTM...should we make a new issue for CuPy case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready to merge PR is ready to merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

classify.reclassify(): consider using binary search?
3 participants