Skip to content

Speed-up neighbors calculation #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 17 commits into from
Closed

Speed-up neighbors calculation #68

wants to merge 17 commits into from

Conversation

claudi
Copy link

@claudi claudi commented Mar 29, 2022

See #61

@claudi
Copy link
Author

claudi commented Mar 29, 2022

It is left as a draft PR, as I haven't had the chance to run the GPU code.

@claudi
Copy link
Author

claudi commented Jun 24, 2022

The performance results show that, the executions of both inference and training take less time. Most notably, from 90% of execution time is spent on the actual inference calculations of the model, up from only about 55% (70% for small molecules). This is more pronounced in big molecules/loads, where the percentage of time spent performing the actual calculations goes up to 98%.

This means execution time is now dedicated almost purely to model evaluation (which hasn't changed), rather than auxiliary computations.

It makes sense that the effect is less pronounced in small molecules (although still satisfactory), as the CPU implementation on those cases is fast enough, and the GPU implementation loses non-negligible time to communication. As mentioned, it is still faster.

This is as measured by profiling TorchMD_GN.forward_call on metro16.

The results are equivalent, up to a tolerance (10e-5) in the distances, from the original implementation, which was the desired behaviour.

@claudi claudi changed the title Implement neighbors calculation Speed-up neighbors calculation Jun 24, 2022
@claudi claudi marked this pull request as ready for review June 24, 2022 10:45
claudi added 17 commits June 25, 2022 11:32
The pytest library seems to do something weird with the device, making
any batches of GPU launches after the first one to always fail... You
can confirm it is not the function itself failing by either running the
same parameters in torchmd-net/neighbors/demo.py or switching the
order of the pytest parameters to appear at the beginning.

For example, in this commit the cases for radius equal to 10 fail. If
you change

[email protected]('radius', [8, 10])
[email protected]('radius', [10, 8])

suddenly the cases for radius 8 are the ones to fail, while the cases
for radius 10 work without a problem.

There is no problem with the CPU executions.
Race conditions should not apply, since all the threads that go through
this path will be writing the same value
@raimis
Copy link
Collaborator

raimis commented Jun 27, 2022

Could you post a table with the raw numbers of your benchmarks?

@claudi
Copy link
Author

claudi commented Jun 28, 2022

Sure, all of this is from metro16.

For the profiles, the time measurements are off due to the profiling itself, what matters are the percentages.

aten::linear and aten::addmm correspond to the model inference. Their time % after optimizing the neighbor search should be as high as possible.

CLN

Original

Name Time % Time total # calls
main 99.98 % 345.437 ms 1
aten::linear 80.01 % 276.414 ms 34
aten::addmm 79.69 % 275.307 ms 28
cudaFree 78.60 % 271.572 ms 2
radius_graph 1.23 % 4.235 ms 1
cudaLaunchKernel 0.93 % 3.210 ms 286
torch_cluster::radius 0.55 % 1.888 ms 1
aten::nonzero 0.53 % 1.826 ms 6
aten::index_select 0.49 % 1.707 ms 9
aten::mul 0.47 % 1.628 ms 42
aten::index 0.44 % 1.507 ms 6
aten::embedding 0.36 % 1.237 ms 2
aten::masked_select 0.30 % 1.036 ms 2

New

Name Time % Time total of calls
main 99.98 % 289.450 ms 1
aten::linear 94.87 % 274.645 ms 34
aten::addmm 94.46 % 273.459 ms 28
cudaFree 93.16 % 269.695 ms 2
cudaLaunchKernel 0.93 % 2.699 ms 231
neighbors::get_neighbor_list 0.78 % 2.263 ms 1
aten::index_select 0.72 % 2.081 ms 14
aten::mul 0.56 % 1.635 ms 42
aten::embedding 0.43 % 1.251 ms 2

CLN batch size 64

Original

Name Time % Total time # of calls
main 99.99 % 518.933 ms 1
aten::linear 55.20 % 286.492 ms 34
aten::addmm 54.99 % 285.403 ms 28
cudaFree 54.28 % 281.679 ms 2
cudaMemcpyAsync 30.14 % 156.412 ms 13
aten::item 19.58 % 101.631 ms 6
aten::_local_scalar_dense 19.58 % 101.596 ms 6
radius_graph 12.85 % 66.681 ms 1
aten::nonzero 10.91 % 56.635 ms 6
torch_cluster::radius 10.73 % 55.708 ms 1
aten::masked_select 9.07 % 47.086 ms 2
aten::index 1.98 % 10.257 ms 6
cudaMalloc 1.68 % 8.713 ms 7
aten::empty 1.58 % 8.200 ms 52
aten::full 1.50 % 7.770 ms 2
aten::is_nonzero 0.97 % 5.010 ms 2
cudaLaunchKernel 0.47 % 2.424 ms 293
aten::index_select 0.32 % 1.685 ms 9
aten::mul 0.29 % 1.520 ms 42
aten::embedding 0.24 % 1.240 ms 2

New

Name Time % Time total of calls
main 99.99 % 299.770 ms 1
aten::linear 91.54 % 274.439 ms 34
aten::addmm 91.16 % 273.321 ms 28
cudaFree 89.94 % 269.639 ms 2
aten::item 3.35 % 10.054 ms 9
aten::_local_scalar_dense 3.33 % 9.998 ms 9
cudaMemcpyAsync 3.28 % 9.820 ms 9
neighbors::get_neighbor_list 1.28 % 3.837 ms 1

FC9

Original

Name Time % Time total of calls
main 99.99 % 472.592 ms 1
aten::linear 59.15 % 279.578 ms 34
aten::addmm 58.92 % 278.482 ms 28
cudaFree 58.10 % 274.618 ms 2
cudaMemcpyAsync 24.85 % 117.462 ms 13
aten::item 14.45 % 68.286 ms 6
aten::_local_scalar_dense 14.44 % 68.250 ms 6
radius_graph 13.02 % 61.555 ms 1
torch_cluster::radius 11.15 % 52.713 ms 1
aten::nonzero 10.82 % 51.121 ms 6
aten::masked_select 9.64 % 45.580 ms 2
cudaMalloc 1.60 % 7.559 ms 9
aten::empty 1.45 % 6.847 ms 52
aten::full 1.36 % 6.447 ms 2
aten::index 1.33 % 6.303 ms 6
cudaLaunchKernel 0.71 % 3.347 ms 293
aten::is_nonzero 0.65 % 3.063 ms 2
aten::mul 0.39 % 1.865 ms 42
aten::index_select 0.37 % 1.740 ms 9
aten::embedding 0.27 % 1.262 ms 2

New

Name Time % Time total of calls
main 99.98 % 296.037 ms 1
aten::linear 92.72 % 274.534 ms 34
aten::addmm 92.32 % 273.345 ms 28
cudaFree 91.01 % 269.472 ms 2
aten::item 2.29 % 6.785 ms 7
aten::_local_scalar_dense 2.28 % 6.738 ms 7
cudaMemcpyAsync 2.23 % 6.594 ms 7
cudaLaunchKernel 0.91 % 2.686 ms 231
neighbors::get_neighbor_list 0.82 % 2.433 ms 1
aten::index_select 0.75 % 2.210 ms 14
aten::mul 0.63 % 1.868 ms 42
aten::embedding 0.42 % 1.255 ms 2

This selection of examples hopefully covers a wide enough range of scenarios. CLN being one of the smallest molecules, CLN batched 64 times is one of the biggest systems that could be evaluated on the hardware, and FC9 is the biggest molecule that can be executed.

Times (ms)

The following is elapsed time, which I calculated with the code from your benchmarks notebook. They also show how the new implementation is much more memory efficient, allowing us to run way bigger batch sizes.

Original

Batch size\Protein ALA2 CLN DHFR FC9
1 5.59 6.52 50.94 124.32
2 5.70 7.47 95.92
4 5.94 12.41 185.88
8 6.45 22.58
16 7.38 43.11
32 9.33 84.62
64 16.32 167.19
128 30.56
256 59.32
512 117.01

New

Batch size\Protein ALA2 CLN DHFR FC9
1 4.95 5.09 16.53 17.14
2 4.97 5.17 17.27 18.84
4 5.00 8.20 18.78 22.53
8 5.10 14.64 22.06 30.08
16 5.15 16.31 28.89 46.46
32 5.36 17.50 44.78 85.30
64 8.78 20.10 86.33 199.83
128 15.83 26.49 224.14 595.96
256 18.34 44.89 835.28 2234.87
512 21.69 102.03 3724.39 9306.15
1024 31.17 306.84 16505.30

@claudi
Copy link
Author

claudi commented Jul 5, 2022

@raimis have you had the chance to look at this?

@raimis
Copy link
Collaborator

raimis commented Jul 6, 2022

Yes, the speed up of DHFR and FC9 looks very good.

@claudi claudi closed this Sep 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants