Skip to content

Fix: MPI communication errors due to inconsistent R-coordinates in sparse matrix generation​ #6233

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 29, 2025

Conversation

AsTonyshment
Copy link
Collaborator

Linked Issue

Fix #6229

Description

When generating sparse matrices using get_R_range, different MPI processes could produce inconsistent sets of R-coordinates. This led to MPI_ERR_TRUNCATE errors during MPI_Allreduce operations because the data sizes across processes mismatched.

My solution is:

  1. Synchronize R-coordinates globally:
    • Added sync_all_R_coor to aggregate R-coordinates from all processes via MPI_Allgatherv, ensuring a consistent all_R_coor set across all ranks.

  2. Fix MPI buffer size handling:
    • Corrected buffer size calculations in MPI_Allgatherv to account for 3 integers (x, y, z) per R-coordinate, resolving MPI_ERR_TRUNCATE.

@AsTonyshment AsTonyshment requested a review from mohanchen May 23, 2025 10:13
@mohanchen mohanchen added Bugs Bugs that only solvable with sufficient knowledge of DFT Refactor Refactor ABACUS codes labels May 24, 2025
@dyzheng dyzheng self-requested a review May 27, 2025 06:11
@mohanchen mohanchen merged commit 5daf5d9 into deepmodeling:develop May 29, 2025
14 checks passed
@AsTonyshment AsTonyshment deleted the fix_scf_MPI_ERR_TRUNCATE branch May 29, 2025 02:21
kluophysics pushed a commit to kluophysics/abacus-develop that referenced this pull request Jun 5, 2025
…arse matrix generation​ (deepmodeling#6233)

* Fix MPI_ERR_TRUNCATE error

* Add MPI compilation macro

* Temp debug info print

* Move sync operation into get_R_range
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bugs Bugs that only solvable with sufficient knowledge of DFT Refactor Refactor ABACUS codes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Bug: Segmentation fault during SCF calculation in v3.9.0.3+ for specific structure (MPI_ERR_TRUNCATE)
3 participants