Speed-up neighbors calculation #68

claudi · 2022-03-29T14:33:30Z

claudi · 2022-03-29T14:36:44Z

It is left as a draft PR, as I haven't had the chance to run the GPU code.

claudi · 2022-06-24T10:44:24Z

The performance results show that, the executions of both inference and training take less time. Most notably, from 90% of execution time is spent on the actual inference calculations of the model, up from only about 55% (70% for small molecules). This is more pronounced in big molecules/loads, where the percentage of time spent performing the actual calculations goes up to 98%.

This means execution time is now dedicated almost purely to model evaluation (which hasn't changed), rather than auxiliary computations.

It makes sense that the effect is less pronounced in small molecules (although still satisfactory), as the CPU implementation on those cases is fast enough, and the GPU implementation loses non-negligible time to communication. As mentioned, it is still faster.

This is as measured by profiling TorchMD_GN.forward_call on metro16.

The results are equivalent, up to a tolerance (10e-5) in the distances, from the original implementation, which was the desired behaviour.

The pytest library seems to do something weird with the device, making any batches of GPU launches after the first one to always fail... You can confirm it is not the function itself failing by either running the same parameters in torchmd-net/neighbors/demo.py or switching the order of the pytest parameters to appear at the beginning. For example, in this commit the cases for radius equal to 10 fail. If you change [email protected]('radius', [8, 10]) [email protected]('radius', [10, 8]) suddenly the cases for radius 8 are the ones to fail, while the cases for radius 10 work without a problem. There is no problem with the CPU executions.

Race conditions should not apply, since all the threads that go through this path will be writing the same value

raimis · 2022-06-27T10:33:16Z

Could you post a table with the raw numbers of your benchmarks?

claudi · 2022-06-28T16:11:51Z

Sure, all of this is from metro16.

For the profiles, the time measurements are off due to the profiling itself, what matters are the percentages.

aten::linear and aten::addmm correspond to the model inference. Their time % after optimizing the neighbor search should be as high as possible.

CLN

Original

Name	Time %	Time total	# calls
main	99.98 %	345.437 ms	1
aten::linear	80.01 %	276.414 ms	34
aten::addmm	79.69 %	275.307 ms	28
cudaFree	78.60 %	271.572 ms	2
radius_graph	1.23 %	4.235 ms	1
cudaLaunchKernel	0.93 %	3.210 ms	286
torch_cluster::radius	0.55 %	1.888 ms	1
aten::nonzero	0.53 %	1.826 ms	6
aten::index_select	0.49 %	1.707 ms	9
aten::mul	0.47 %	1.628 ms	42
aten::index	0.44 %	1.507 ms	6
aten::embedding	0.36 %	1.237 ms	2
aten::masked_select	0.30 %	1.036 ms	2

New

Name	Time %	Time total	of calls
main	99.98 %	289.450 ms	1
aten::linear	94.87 %	274.645 ms	34
aten::addmm	94.46 %	273.459 ms	28
cudaFree	93.16 %	269.695 ms	2
cudaLaunchKernel	0.93 %	2.699 ms	231
neighbors::get_neighbor_list	0.78 %	2.263 ms	1
aten::index_select	0.72 %	2.081 ms	14
aten::mul	0.56 %	1.635 ms	42
aten::embedding	0.43 %	1.251 ms	2

CLN batch size 64

Original

Name	Time %	Total time	# of calls
main	99.99 %	518.933 ms	1
aten::linear	55.20 %	286.492 ms	34
aten::addmm	54.99 %	285.403 ms	28
cudaFree	54.28 %	281.679 ms	2
cudaMemcpyAsync	30.14 %	156.412 ms	13
aten::item	19.58 %	101.631 ms	6
aten::_local_scalar_dense	19.58 %	101.596 ms	6
radius_graph	12.85 %	66.681 ms	1
aten::nonzero	10.91 %	56.635 ms	6
torch_cluster::radius	10.73 %	55.708 ms	1
aten::masked_select	9.07 %	47.086 ms	2
aten::index	1.98 %	10.257 ms	6
cudaMalloc	1.68 %	8.713 ms	7
aten::empty	1.58 %	8.200 ms	52
aten::full	1.50 %	7.770 ms	2
aten::is_nonzero	0.97 %	5.010 ms	2
cudaLaunchKernel	0.47 %	2.424 ms	293
aten::index_select	0.32 %	1.685 ms	9
aten::mul	0.29 %	1.520 ms	42
aten::embedding	0.24 %	1.240 ms	2

New

Name	Time %	Time total	of calls
main	99.99 %	299.770 ms	1
aten::linear	91.54 %	274.439 ms	34
aten::addmm	91.16 %	273.321 ms	28
cudaFree	89.94 %	269.639 ms	2
aten::item	3.35 %	10.054 ms	9
aten::_local_scalar_dense	3.33 %	9.998 ms	9
cudaMemcpyAsync	3.28 %	9.820 ms	9
neighbors::get_neighbor_list	1.28 %	3.837 ms	1

FC9

Original

Name	Time %	Time total	of calls
main	99.99 %	472.592 ms	1
aten::linear	59.15 %	279.578 ms	34
aten::addmm	58.92 %	278.482 ms	28
cudaFree	58.10 %	274.618 ms	2
cudaMemcpyAsync	24.85 %	117.462 ms	13
aten::item	14.45 %	68.286 ms	6
aten::_local_scalar_dense	14.44 %	68.250 ms	6
radius_graph	13.02 %	61.555 ms	1
torch_cluster::radius	11.15 %	52.713 ms	1
aten::nonzero	10.82 %	51.121 ms	6
aten::masked_select	9.64 %	45.580 ms	2
cudaMalloc	1.60 %	7.559 ms	9
aten::empty	1.45 %	6.847 ms	52
aten::full	1.36 %	6.447 ms	2
aten::index	1.33 %	6.303 ms	6
cudaLaunchKernel	0.71 %	3.347 ms	293
aten::is_nonzero	0.65 %	3.063 ms	2
aten::mul	0.39 %	1.865 ms	42
aten::index_select	0.37 %	1.740 ms	9
aten::embedding	0.27 %	1.262 ms	2

New

Name	Time %	Time total	of calls
main	99.98 %	296.037 ms	1
aten::linear	92.72 %	274.534 ms	34
aten::addmm	92.32 %	273.345 ms	28
cudaFree	91.01 %	269.472 ms	2
aten::item	2.29 %	6.785 ms	7
aten::_local_scalar_dense	2.28 %	6.738 ms	7
cudaMemcpyAsync	2.23 %	6.594 ms	7
cudaLaunchKernel	0.91 %	2.686 ms	231
neighbors::get_neighbor_list	0.82 %	2.433 ms	1
aten::index_select	0.75 %	2.210 ms	14
aten::mul	0.63 %	1.868 ms	42
aten::embedding	0.42 %	1.255 ms	2

This selection of examples hopefully covers a wide enough range of scenarios. CLN being one of the smallest molecules, CLN batched 64 times is one of the biggest systems that could be evaluated on the hardware, and FC9 is the biggest molecule that can be executed.

Times (ms)

The following is elapsed time, which I calculated with the code from your benchmarks notebook. They also show how the new implementation is much more memory efficient, allowing us to run way bigger batch sizes.

Original

Batch size\Protein	ALA2	CLN	DHFR	FC9
1	5.59	6.52	50.94	124.32
2	5.70	7.47	95.92
4	5.94	12.41	185.88
8	6.45	22.58
16	7.38	43.11
32	9.33	84.62
64	16.32	167.19
128	30.56
256	59.32
512	117.01

New

Batch size\Protein	ALA2	CLN	DHFR	FC9
1	4.95	5.09	16.53	17.14
2	4.97	5.17	17.27	18.84
4	5.00	8.20	18.78	22.53
8	5.10	14.64	22.06	30.08
16	5.15	16.31	28.89	46.46
32	5.36	17.50	44.78	85.30
64	8.78	20.10	86.33	199.83
128	15.83	26.49	224.14	595.96
256	18.34	44.89	835.28	2234.87
512	21.69	102.03	3724.39	9306.15
1024	31.17	306.84	16505.30

claudi · 2022-07-05T17:34:34Z

@raimis have you had the chance to look at this?

raimis · 2022-07-06T14:09:26Z

Yes, the speed up of DHFR and FC9 looks very good.

raimis mentioned this pull request Jun 14, 2022

Nearest neighbor operation openmm/NNPOps#58

Merged

4 tasks

claudi changed the title ~~Implement neighbors calculation~~ Speed-up neighbors calculation Jun 24, 2022

claudi marked this pull request as ready for review June 24, 2022 10:45

claudi added 17 commits June 25, 2022 11:32

Add skeleton for neighbor calculation function

a514160

Add CPU implementation of neighbor search

5a0eb9f

Add GPU implementation of neighbor search

029dc38

Improve implementation of GPU neighbors calculation

886fd3e

Adding function arguments and general bug fixing

fd438ea

Replace case statement with static values array

0a40282

Fix uninitialized variable

add7b6c

Add more tests

5618687

Cleanup double declaration

32d8979

Change some types

f25157e

Avoid registering neighbors_number as growing

2a76d7a

Race conditions should not apply, since all the threads that go through this path will be writing the same value

Add max_num_neighbors to argument list

347d1a1

Avoid running off memory in forward_kernel_get_neighbors

eefee6c

Fix return of backward kernel

9d8a36b

Improve kernel to get max and min coords

d10dd03

Upload last version of GPU neighbor search

fd09ed2

claudi closed this Sep 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speed-up neighbors calculation #68

Speed-up neighbors calculation #68

Uh oh!

claudi commented Mar 29, 2022

Uh oh!

claudi commented Mar 29, 2022

Uh oh!

claudi commented Jun 24, 2022 •

edited

Loading

Uh oh!

raimis commented Jun 27, 2022

Uh oh!

claudi commented Jun 28, 2022 •

edited

Loading

Uh oh!

claudi commented Jul 5, 2022

Uh oh!

raimis commented Jul 6, 2022

Uh oh!

Uh oh!

Speed-up neighbors calculation #68

Speed-up neighbors calculation #68

Uh oh!

Conversation

claudi commented Mar 29, 2022

Uh oh!

claudi commented Mar 29, 2022

Uh oh!

claudi commented Jun 24, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

raimis commented Jun 27, 2022

Uh oh!

claudi commented Jun 28, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CLN

Original

New

CLN batch size 64

Original

New

FC9

Original

New

Times (ms)

Original

New

Uh oh!

claudi commented Jul 5, 2022

Uh oh!

raimis commented Jul 6, 2022

Uh oh!

Uh oh!

claudi commented Jun 24, 2022 •

edited

Loading

claudi commented Jun 28, 2022 •

edited

Loading