forked from abacusmodeling/abacus-develop
-
Notifications
You must be signed in to change notification settings - Fork 141
Perf: optimize the stream strategy in module_gint #5845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
mohanchen
approved these changes
Jan 10, 2025
dyzheng
pushed a commit
to dyzheng/abacus-develop
that referenced
this pull request
Mar 28, 2025
* optimize stream strategy * limit max threads
dyzheng
added a commit
that referenced
this pull request
Mar 28, 2025
* Fix: stress error with Dojo pseudopotential and LIBXC * Fix: nspin2/4 mismatch with nspin1 with PBE * Fix: add test case to CI * Fix: delete useless warning of write_dmr * Fix: DFTU output format * Fix: error of noncolin and autoset mag * Fix: reference of noncolin * Revert "Fix: nspin2/4 mismatch with nspin1 with PBE" This reverts commit ffd91ff. * Perf: optimize the stream strategy in module_gint (#5845) * optimize stream strategy * limit max threads * Fix: modify orb info manually (#5853) * Fix: parse_expression for scientific notation (#5882) * Fix: parse_expression for scientific notation * modify openmp strategy (#5898) * Fix document description for ocp and ocp_set (#5896) * Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS (#5905) * Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS * Fix: Resolve compilation issue with Libxc 7.0.0 in ABACUS: fix a minor test issue (304_NO_GO_AF_atommag) * Fix a bug and a magic number in module_exx_symmetry (#5848) * fix a magic number in get_euler_angle * do not allow higher symmetry of bvk supercell than the original cell * Docs: update docs about init_wfc (#5912) * Fix the wrong symmetry analysis at nspin=2 (#5926) * analyze magnetic group without time-reversal symmetry * fix: need to calculate direct coordinates again * fix a bug about hcontainer in exx nscf (#5927) * fix cmake bug (#5929) * inline function of complexarray (#5964) * modify doc (#5965) * Fix segmentation fault in integrate test 312_NO_GO_wfc_get_wf (#5970) * Doc: polish Quick Start part of online doc (#6006) * polish Quick Start in online doc * set scf_thr 1e-6 * correct typo * test: fix Dockerfile.intel (#5999) Co-authored-by: root <pxlxingliang> * fix the format (#6008) * Fix : out_mat_dh will lead to different result with MPI-1core with MPI-4core (#6018) * Fix: Enhance the warning message when the XC name cannot be recognized. (#6025) * Update latest Intel oneAPI default compiler for cxx (#6035) * Update latest Intel oneAPI default compiler for cxx * Update elpa version to newest in demo cmake script * Fix: Angular momentum quantum number check in reading SOC pseudopot file (#6027) * Fix the angular momentum quantum number check in reading SOC pseudopot file * Fix related unit test problem and add an SOC pseudopot file * Refactor SOC check logic for improved readability * Feature: support the `default` as the value of `dft_functional` when initialize vdw (#5949) * Feature: support the `default` as the value of `dft_functional` when initialize vdw * Refactor a littble bit * Optimize: Compilation time of vdwd3_autoset_xcparam.cpp (#6042) The compilation time of the vdwd3_autoset_xcparam.cpp file is reduced from 250 seconds to just 5 seconds in my machine. Thanks to the suggestion from DeepSeek: replacing dynamic initialization with a static array for constructing the std::map * directly enter exx loop when init_wfc=file (#6019) * Perf: openmp for cal_force_stress (#5956) * remove wrong timer * omp for cal_force_stress * openmp for cal_force_stress in dftu * openmp for cal_force_stress in dspin * little change * fix bug * fix a bug * Fix: DFT+U force&stress with of some elements are -1 (#6049) Co-authored-by: dyzheng <[email protected]> * Fix: add the print header for `cusolvermp` in scf info (#6038) * fix an output for debug (#6066) * Perf: optimize cal_DMR and folding_HR (#6068) * modify variable name * modify variable name * change pointer to ptr * modify variable name * modify some variable names * move functions from .cpp to .h * optimize cal_DMR * add schedule(dynamic) * optimize func_folding * add a check before calculating EXX force (#6067) * fixing issue #5961 (#6071) * modify warning output (#6074) * Version: 3.10.0 --------- Co-authored-by: dzzz2001 <[email protected]> Co-authored-by: Yu Liu <[email protected]> Co-authored-by: jiyuyang <[email protected]> Co-authored-by: Taoni Bao <[email protected]> Co-authored-by: Qianrui Liu <[email protected]> Co-authored-by: LUNASEA <[email protected]> Co-authored-by: wqzhou <[email protected]> Co-authored-by: Peng Xingliang <[email protected]> Co-authored-by: Xinyuan Liang <[email protected]> Co-authored-by: Liang Sun <[email protected]> Co-authored-by: Chen Nuo <[email protected]> Co-authored-by: kirk0830 <[email protected]> Co-authored-by: dyzheng <[email protected]> Co-authored-by: Jie Bao <[email protected]>
Fisherd99
pushed a commit
to Fisherd99/abacus-BSE
that referenced
this pull request
Mar 31, 2025
* optimize stream strategy * limit max threads
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Background
While testing the LCAO GPU version of abacus on an A800 GPU, I noticed a significant difference in performance when running different commands on a machine with only 16 cores. Specifically, the efficiency of the command OMP_NUM_THREADS=4 mpirun -n 4 differs greatly from that of the command OMP_NUM_THREADS=1 mpirun -n 16. The cal_gint efficiency of the latter can be approximately 8 times slower than the former. Below are the runtime statistics I collected(the test case is si256):
After reviewing the code, I discovered that the significant difference in performance might be due to the OpenMP thread setting strategy in the GPU code of module_gint:

From the code, it is evident that the grid integration code sets num_stream parallel threads (where num_stream is typically 4) regardless of whether the system has enough cores. This likely results in the number of threads exceeding the available system cores, leading to a loss in efficiency. Therefore, I modified the thread settings here to address this issue.
Additionally, the stream synchronization strategy in module_gint was previously rather coarse. I have now reset the stream synchronization strategy using CUDA events, which has resulted in some performance gains. After completing all modifications, I re-measured the runtime for the same test cases, with the following results: