Skip to content

Perf: optimize the stream strategy in module_gint #5845

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jan 10, 2025

Conversation

dzzz2001
Copy link
Collaborator

@dzzz2001 dzzz2001 commented Jan 10, 2025

Background

While testing the LCAO GPU version of abacus on an A800 GPU, I noticed a significant difference in performance when running different commands on a machine with only 16 cores. Specifically, the efficiency of the command OMP_NUM_THREADS=4 mpirun -n 4 differs greatly from that of the command OMP_NUM_THREADS=1 mpirun -n 16. The cal_gint efficiency of the latter can be approximately 8 times slower than the former. Below are the runtime statistics I collected(the test case is si256):

command cal_gint_vl cal_gint_rho cal_gint_force
omp 4 mpirun 4 15.49 14.39 2.30
omp 1 mpirun 16 114.35 113.6 19.25

After reviewing the code, I discovered that the significant difference in performance might be due to the OpenMP thread setting strategy in the GPU code of module_gint:
image
From the code, it is evident that the grid integration code sets num_stream parallel threads (where num_stream is typically 4) regardless of whether the system has enough cores. This likely results in the number of threads exceeding the available system cores, leading to a loss in efficiency. Therefore, I modified the thread settings here to address this issue.
Additionally, the stream synchronization strategy in module_gint was previously rather coarse. I have now reset the stream synchronization strategy using CUDA events, which has resulted in some performance gains. After completing all modifications, I re-measured the runtime for the same test cases, with the following results:

command cal_gint_vl cal_gint_rho cal_gint_force
omp 4 mpirun 4 10.99 9.97 3.99
omp 1 mpirun 16 28.60 28.60 9.14

@mohanchen mohanchen added GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes labels Jan 10, 2025
@mohanchen mohanchen merged commit 16714c6 into deepmodeling:develop Jan 10, 2025
14 checks passed
dyzheng pushed a commit to dyzheng/abacus-develop that referenced this pull request Mar 28, 2025
* optimize stream strategy

* limit max threads
dyzheng added a commit that referenced this pull request Mar 28, 2025
* Fix: stress error with Dojo pseudopotential and LIBXC

* Fix: nspin2/4 mismatch with nspin1 with PBE

* Fix: add test case to CI

* Fix: delete useless warning of write_dmr

* Fix: DFTU output format

* Fix: error of noncolin and autoset mag

* Fix: reference of noncolin

* Revert "Fix: nspin2/4 mismatch with nspin1 with PBE"

This reverts commit ffd91ff.

* Perf: optimize the stream strategy in module_gint (#5845)

* optimize stream strategy

* limit max threads

* Fix: modify orb info manually (#5853)

* Fix: parse_expression for scientific notation (#5882)

* Fix: parse_expression for scientific notation

* modify openmp strategy (#5898)

* Fix document description for ocp and ocp_set (#5896)

* Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS (#5905)

* Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS

* Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS: fix a minor test issue (304_NO_GO_AF_atommag)

* Fix  a bug and a magic number in module_exx_symmetry (#5848)

* fix a magic number in get_euler_angle

* do not allow higher symmetry of bvk supercell than the original cell

* Docs: update docs about init_wfc (#5912)

* Fix the wrong symmetry analysis at nspin=2 (#5926)

* analyze magnetic group without time-reversal symmetry

* fix: need to calculate direct coordinates again

* fix a bug about hcontainer in exx nscf (#5927)

* fix cmake bug (#5929)

* inline function of complexarray (#5964)

* modify doc (#5965)

* Fix segmentation fault in integrate test 312_NO_GO_wfc_get_wf (#5970)

* Doc: polish Quick Start part of online doc (#6006)

* polish Quick Start in online doc

* set scf_thr 1e-6

* correct typo

* test: fix Dockerfile.intel (#5999)

Co-authored-by: root <pxlxingliang>

* fix the format (#6008)

* Fix : out_mat_dh will lead to different result with MPI-1core with MPI-4core (#6018)

* Fix: Enhance the warning message when the XC name cannot be recognized. (#6025)

* Update latest Intel oneAPI default compiler for cxx (#6035)

* Update latest Intel oneAPI default compiler for cxx

* Update elpa version to newest in demo cmake script

* Fix: Angular momentum quantum number check in reading SOC pseudopot file (#6027)

* Fix the angular momentum quantum number check in reading SOC pseudopot file

* Fix related unit test problem and add an SOC pseudopot file

* Refactor SOC check logic for improved readability

* Feature: support the `default` as the value of `dft_functional` when initialize vdw (#5949)

* Feature: support the `default` as the value of `dft_functional` when initialize vdw

* Refactor a littble bit

* Optimize: Compilation time of vdwd3_autoset_xcparam.cpp (#6042)

The compilation time of the vdwd3_autoset_xcparam.cpp file is reduced from 250 seconds to just 5 seconds in my machine.
Thanks to the suggestion from DeepSeek: replacing dynamic initialization with a static array for constructing the std::map

* directly enter exx loop when init_wfc=file (#6019)

* Perf: openmp for cal_force_stress (#5956)

* remove wrong timer

* omp for cal_force_stress

* openmp for cal_force_stress in dftu

* openmp for cal_force_stress in dspin

* little change

* fix bug

* fix a bug

* Fix: DFT+U force&stress with  of some elements are -1 (#6049)

Co-authored-by: dyzheng <[email protected]>

* Fix: add the print header for `cusolvermp` in scf info (#6038)

* fix an output for debug (#6066)

* Perf: optimize cal_DMR and folding_HR (#6068)

* modify variable name

* modify variable name

* change pointer to ptr

* modify variable name

* modify some variable names

* move functions from .cpp to .h

* optimize cal_DMR

* add schedule(dynamic)

* optimize func_folding

* add a check before calculating EXX force (#6067)

* fixing issue #5961 (#6071)

* modify warning output (#6074)

* Version: 3.10.0

---------

Co-authored-by: dzzz2001 <[email protected]>
Co-authored-by: Yu Liu <[email protected]>
Co-authored-by: jiyuyang <[email protected]>
Co-authored-by: Taoni Bao <[email protected]>
Co-authored-by: Qianrui Liu <[email protected]>
Co-authored-by: LUNASEA <[email protected]>
Co-authored-by: wqzhou <[email protected]>
Co-authored-by: Peng Xingliang <[email protected]>
Co-authored-by: Xinyuan Liang <[email protected]>
Co-authored-by: Liang Sun <[email protected]>
Co-authored-by: Chen Nuo <[email protected]>
Co-authored-by: kirk0830 <[email protected]>
Co-authored-by: dyzheng <[email protected]>
Co-authored-by: Jie Bao <[email protected]>
Fisherd99 pushed a commit to Fisherd99/abacus-BSE that referenced this pull request Mar 31, 2025
* optimize stream strategy

* limit max threads
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
GPU & DCU & HPC GPU and DCU and HPC related any issues Refactor Refactor ABACUS codes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants