-
Notifications
You must be signed in to change notification settings - Fork 85
Speed-up neighbors calculation #68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It is left as a draft PR, as I haven't had the chance to run the GPU code. |
The performance results show that, the executions of both inference and training take less time. Most notably, from 90% of execution time is spent on the actual inference calculations of the model, up from only about 55% (70% for small molecules). This is more pronounced in big molecules/loads, where the percentage of time spent performing the actual calculations goes up to 98%. This means execution time is now dedicated almost purely to model evaluation (which hasn't changed), rather than auxiliary computations. It makes sense that the effect is less pronounced in small molecules (although still satisfactory), as the CPU implementation on those cases is fast enough, and the GPU implementation loses non-negligible time to communication. As mentioned, it is still faster. This is as measured by profiling The results are equivalent, up to a tolerance (10e-5) in the distances, from the original implementation, which was the desired behaviour. |
The pytest library seems to do something weird with the device, making any batches of GPU launches after the first one to always fail... You can confirm it is not the function itself failing by either running the same parameters in torchmd-net/neighbors/demo.py or switching the order of the pytest parameters to appear at the beginning. For example, in this commit the cases for radius equal to 10 fail. If you change [email protected]('radius', [8, 10]) [email protected]('radius', [10, 8]) suddenly the cases for radius 8 are the ones to fail, while the cases for radius 10 work without a problem. There is no problem with the CPU executions.
Race conditions should not apply, since all the threads that go through this path will be writing the same value
Could you post a table with the raw numbers of your benchmarks? |
Sure, all of this is from metro16. For the profiles, the time measurements are off due to the profiling itself, what matters are the percentages.
CLNOriginal
New
CLN batch size 64Original
New
FC9Original
New
This selection of examples hopefully covers a wide enough range of scenarios. CLN being one of the smallest molecules, CLN batched 64 times is one of the biggest systems that could be evaluated on the hardware, and FC9 is the biggest molecule that can be executed. Times (ms)The following is elapsed time, which I calculated with the code from your benchmarks notebook. They also show how the new implementation is much more memory efficient, allowing us to run way bigger batch sizes. Original
New
|
@raimis have you had the chance to look at this? |
Yes, the speed up of DHFR and FC9 looks very good. |
See #61