Perf: optimize the stream strategy in module_gint #5845

dzzz2001 · 2025-01-10T07:48:28Z

Background

While testing the LCAO GPU version of abacus on an A800 GPU, I noticed a significant difference in performance when running different commands on a machine with only 16 cores. Specifically, the efficiency of the command OMP_NUM_THREADS=4 mpirun -n 4 differs greatly from that of the command OMP_NUM_THREADS=1 mpirun -n 16. The cal_gint efficiency of the latter can be approximately 8 times slower than the former. Below are the runtime statistics I collected(the test case is si256)：

command	cal_gint_vl	cal_gint_rho	cal_gint_force
omp 4 mpirun 4	15.49	14.39	2.30
omp 1 mpirun 16	114.35	113.6	19.25

After reviewing the code, I discovered that the significant difference in performance might be due to the OpenMP thread setting strategy in the GPU code of module_gint：

From the code, it is evident that the grid integration code sets num_stream parallel threads (where num_stream is typically 4) regardless of whether the system has enough cores. This likely results in the number of threads exceeding the available system cores, leading to a loss in efficiency. Therefore, I modified the thread settings here to address this issue.
Additionally, the stream synchronization strategy in module_gint was previously rather coarse. I have now reset the stream synchronization strategy using CUDA events, which has resulted in some performance gains. After completing all modifications, I re-measured the runtime for the same test cases, with the following results:

command	cal_gint_vl	cal_gint_rho	cal_gint_force
omp 4 mpirun 4	10.99	9.97	3.99
omp 1 mpirun 16	28.60	28.60	9.14

* optimize stream strategy * limit max threads

* Fix: stress error with Dojo pseudopotential and LIBXC * Fix: nspin2/4 mismatch with nspin1 with PBE * Fix: add test case to CI * Fix: delete useless warning of write_dmr * Fix: DFTU output format * Fix: error of noncolin and autoset mag * Fix: reference of noncolin * Revert "Fix: nspin2/4 mismatch with nspin1 with PBE" This reverts commit ffd91ff. * Perf: optimize the stream strategy in module_gint (#5845) * optimize stream strategy * limit max threads * Fix: modify orb info manually (#5853) * Fix: parse_expression for scientific notation (#5882) * Fix: parse_expression for scientific notation * modify openmp strategy (#5898) * Fix document description for ocp and ocp_set (#5896) * Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS (#5905) * Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS * Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS: fix a minor test issue (304_NO_GO_AF_atommag) * Fix a bug and a magic number in module_exx_symmetry (#5848) * fix a magic number in get_euler_angle * do not allow higher symmetry of bvk supercell than the original cell * Docs: update docs about init_wfc (#5912) * Fix the wrong symmetry analysis at nspin=2 (#5926) * analyze magnetic group without time-reversal symmetry * fix: need to calculate direct coordinates again * fix a bug about hcontainer in exx nscf (#5927) * fix cmake bug (#5929) * inline function of complexarray (#5964) * modify doc (#5965) * Fix segmentation fault in integrate test 312_NO_GO_wfc_get_wf (#5970) * Doc: polish Quick Start part of online doc (#6006) * polish Quick Start in online doc * set scf_thr 1e-6 * correct typo * test: fix Dockerfile.intel (#5999) Co-authored-by: root <pxlxingliang> * fix the format (#6008) * Fix : out_mat_dh will lead to different result with MPI-1core with MPI-4core (#6018) * Fix: Enhance the warning message when the XC name cannot be recognized. (#6025) * Update latest Intel oneAPI default compiler for cxx (#6035) * Update latest Intel oneAPI default compiler for cxx * Update elpa version to newest in demo cmake script * Fix: Angular momentum quantum number check in reading SOC pseudopot file (#6027) * Fix the angular momentum quantum number check in reading SOC pseudopot file * Fix related unit test problem and add an SOC pseudopot file * Refactor SOC check logic for improved readability * Feature: support the `default` as the value of `dft_functional` when initialize vdw (#5949) * Feature: support the `default` as the value of `dft_functional` when initialize vdw * Refactor a littble bit * Optimize: Compilation time of vdwd3_autoset_xcparam.cpp (#6042) The compilation time of the vdwd3_autoset_xcparam.cpp file is reduced from 250 seconds to just 5 seconds in my machine. Thanks to the suggestion from DeepSeek: replacing dynamic initialization with a static array for constructing the std::map * directly enter exx loop when init_wfc=file (#6019) * Perf: openmp for cal_force_stress (#5956) * remove wrong timer * omp for cal_force_stress * openmp for cal_force_stress in dftu * openmp for cal_force_stress in dspin * little change * fix bug * fix a bug * Fix: DFT+U force&stress with of some elements are -1 (#6049) Co-authored-by: dyzheng <[email protected]> * Fix: add the print header for `cusolvermp` in scf info (#6038) * fix an output for debug (#6066) * Perf: optimize cal_DMR and folding_HR (#6068) * modify variable name * modify variable name * change pointer to ptr * modify variable name * modify some variable names * move functions from .cpp to .h * optimize cal_DMR * add schedule(dynamic) * optimize func_folding * add a check before calculating EXX force (#6067) * fixing issue #5961 (#6071) * modify warning output (#6074) * Version: 3.10.0 --------- Co-authored-by: dzzz2001 <[email protected]> Co-authored-by: Yu Liu <[email protected]> Co-authored-by: jiyuyang <[email protected]> Co-authored-by: Taoni Bao <[email protected]> Co-authored-by: Qianrui Liu <[email protected]> Co-authored-by: LUNASEA <[email protected]> Co-authored-by: wqzhou <[email protected]> Co-authored-by: Peng Xingliang <[email protected]> Co-authored-by: Xinyuan Liang <[email protected]> Co-authored-by: Liang Sun <[email protected]> Co-authored-by: Chen Nuo <[email protected]> Co-authored-by: kirk0830 <[email protected]> Co-authored-by: dyzheng <[email protected]> Co-authored-by: Jie Bao <[email protected]>

* optimize stream strategy * limit max threads

dzzz2001 added 2 commits January 9, 2025 12:21

optimize stream strategy

b3b948c

limit max threads

2acab4b

dzzz2001 requested review from mohanchen and goodchong January 10, 2025 07:50

Merge branch 'develop' into develop

e34e1b4

mohanchen approved these changes Jan 10, 2025

View reviewed changes

mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes labels Jan 10, 2025

mohanchen merged commit 16714c6 into deepmodeling:develop Jan 10, 2025
14 checks passed

dyzheng pushed a commit to dyzheng/abacus-develop that referenced this pull request Mar 28, 2025

Perf: optimize the stream strategy in module_gint (deepmodeling#5845)

816fe47

* optimize stream strategy * limit max threads

Fisherd99 pushed a commit to Fisherd99/abacus-BSE that referenced this pull request Mar 31, 2025

Perf: optimize the stream strategy in module_gint (deepmodeling#5845)

512501c

* optimize stream strategy * limit max threads

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Perf: optimize the stream strategy in module_gint #5845

Perf: optimize the stream strategy in module_gint #5845

Uh oh!

dzzz2001 commented Jan 10, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Perf: optimize the stream strategy in module_gint #5845

Perf: optimize the stream strategy in module_gint #5845

Uh oh!

Conversation

dzzz2001 commented Jan 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Background

Uh oh!

Uh oh!

Uh oh!

dzzz2001 commented Jan 10, 2025 •

edited

Loading