Skip to content

[BACKEND] Set LLVM_ABI_BREAKING_CHECKS to be able to update the llvm version #4512

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Aug 16, 2024

Conversation

karupayun
Copy link
Collaborator

This PR is setting LLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how we do it inside Google. In this way, we are forcing the structures that relies in Hashing.h to use a deterministic seed.

We were not able to update the llvm version. The culprit llvm commit is llvm/llvm-project@ce80c80, that basically uses a non-deterministic seed if the previous "Flag" is set.

There were some discussions in
#4338, and the other option would be to check/replace code and tests that are overly depending on the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and DenseSet mostly).

I think that the priority is first to be able to update the llvm version, but if OpenAI is interested I could work on replacing them (for MapVector), in order to fix all the failing unit tests, and then we can set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".

  • I am not making a trivial change, such as fixing a typo in a comment.
  • I have written a PR description following these
    rules.
  • I have run pre-commit run --from-ref origin/main --to-ref HEAD.
  • This PR does not need a test because it's just setting a variable for LLVM.
  • I have not added any lit tests.

…version

This PR is setting LLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how
we do it inside Google. In this way, we are forcing the structures
that relies in Hashing.h to use a deterministic seed.

We were not able to update the llvm version. The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
triton-lang#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I could work on replacing them
(for MapVector), in order to fix all the failing unit tests, and then we
can set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".
@karupayun karupayun force-pushed the fix-llvm-hash-update branch from 7d9ccab to 6d4ea00 Compare August 13, 2024 16:32
@karupayun karupayun requested a review from chsigg August 13, 2024 16:32
@ThomasRaoux
Copy link
Collaborator

#4338, and the other option would be to check/replace code and tests that are overly depending on the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and DenseSet mostly).

What cases rely on DenseMap determinism? I thought the breaking problem was coming from a mismatch in the compilation. The failures were happening even running triton-opt without running any code.

but if OpenAI is interested I could work on replacing them (for MapVector), in order to fix all the failing unit tests, and then we can set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".

Again I'm curious what would need to be changed

@karupayun
Copy link
Collaborator Author

karupayun commented Aug 14, 2024

What cases rely on DenseMap determinism? I thought the breaking problem was coming from a mismatch in the compilation. The failures were happening even running triton-opt without running any code.

We don't know which cases rely on DenseMap determinism. The author of the llvm-commit suggested this as a potential issue. There are many cases of DenseMap, and it also could be that the issue is not in Triton but in MLIR.

It's being exposed when running triton-opt because it's missing values in a DenseMap. <<UNKNOWN SSA VALUE>> is being printed here.

All the failing tests can be seen in this job. I started to analyze Conversion/divide-by-0.mlir. I can reproduce in my cloned repository (not inside google) by running:
triton-opt --allocate-shared-memory --convert-triton-gpu-to-llvm test/Conversion/divide-by-0.mlir

The result is strange. It's half broken and half not (references broken until 190 and then they are fine):

    <<UNKNOWN SSA VALUE>> = llvm.mlir.constant(0.000000e+00 : f32) : f32
    ...
    ...
    <<UNKNOWN SSA VALUE>> = llvm.mlir.constant(16 : i32) : i32
    <<UNKNOWN SSA VALUE>> = llvm.mul <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>> : i32
    <<UNKNOWN SSA VALUE>> = llvm.add <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>> : i32
    <<UNKNOWN SSA VALUE>> = llvm.add <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>> : i32
    <<UNKNOWN SSA VALUE>> = llvm.mlir.constant(8 : i32) : i32
    %191 = llvm.mul <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>> : i32
    %192 = llvm.add <<UNKNOWN SSA VALUE>>, %191 : i32
    %193 = llvm.add <<UNKNOWN SSA VALUE>>, %191 : i32
    %194 = llvm.mlir.constant(0 : i32) : i32
    %195 = llvm.add <<UNKNOWN SSA VALUE>>, %194 : i32
    ...
    ...
    %311 = llvm.insertvalue %308, %310[1] : !llvm.struct<(f32, f32)> 
    llvm.return

This seems a resizing issue that invalidates hash-keys (inserting while iterating the container??).

As it was suggested in the commit from llvm, I started to replace some dense_map and dense_set to see if the issue got fixed. But there are many references. My next step is to replace these ones.

But setting the contant LLVM_ABI_BREAKING_CHECKS fixed the issue, so I thought on first doing that and updating llvm and then continue with the investigation.

@ThomasRaoux
Copy link
Collaborator

What cases rely on DenseMap determinism? I thought the breaking problem was coming from a mismatch in the compilation. The failures were happening even running triton-opt without running any code.

We don't know which cases rely on DenseMap determinism. The author of the llvm-commit suggested this as a potential issue. There are many cases of DenseMap, and it also could be that the issue is not in Triton but in MLIR.

It's being exposed when running triton-opt because it's missing values in a DenseMap. <<UNKNOWN SSA VALUE>> is being printed here.

All the failing tests can be seen in this job. I started to analyze Conversion/divide-by-0.mlir. I can reproduce in my cloned repository (not inside google) by running: triton-opt --allocate-shared-memory --convert-triton-gpu-to-llvm test/Conversion/divide-by-0.mlir

The result is strange. It's half broken and half not (references broken until 190 and then they are fine):

    <<UNKNOWN SSA VALUE>> = llvm.mlir.constant(0.000000e+00 : f32) : f32
    ...
    ...
    <<UNKNOWN SSA VALUE>> = llvm.mlir.constant(16 : i32) : i32
    <<UNKNOWN SSA VALUE>> = llvm.mul <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>> : i32
    <<UNKNOWN SSA VALUE>> = llvm.add <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>> : i32
    <<UNKNOWN SSA VALUE>> = llvm.add <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>> : i32
    <<UNKNOWN SSA VALUE>> = llvm.mlir.constant(8 : i32) : i32
    %191 = llvm.mul <<UNKNOWN SSA VALUE>>, <<UNKNOWN SSA VALUE>> : i32
    %192 = llvm.add <<UNKNOWN SSA VALUE>>, %191 : i32
    %193 = llvm.add <<UNKNOWN SSA VALUE>>, %191 : i32
    %194 = llvm.mlir.constant(0 : i32) : i32
    %195 = llvm.add <<UNKNOWN SSA VALUE>>, %194 : i32
    ...
    ...
    %311 = llvm.insertvalue %308, %310[1] : !llvm.struct<(f32, f32)> 
    llvm.return

This seems a resizing issue that invalidates hash-keys (inserting while iterating the container??).

As it was suggested in the commit from llvm, I started to replace some dense_map and dense_set to see if the issue got fixed. But there are many references. My next step is to replace these ones.

But setting the contant LLVM_ABI_BREAKING_CHECKS fixed the issue, so I thought on first doing that and updating llvm and then continue with the investigation.

Well those are definitely not a problem of determinism and I don't think converting those DenseMap is the right thing to do. The examples you pointed out should stay dense_map.

Anyway, is this PR ready to land?

@karupayun karupayun marked this pull request as ready for review August 14, 2024 15:47
@karupayun karupayun requested a review from ptillet as a code owner August 14, 2024 15:47
@karupayun
Copy link
Collaborator Author

Well those are definitely not a problem of determinism and I don't think converting those DenseMap is the right thing to do. The examples you pointed out should stay dense_map.

Anyway, is this PR ready to land?

Yes

@ThomasRaoux ThomasRaoux merged commit 73d09c4 into triton-lang:main Aug 16, 2024
6 checks passed
karupayun added a commit to openxla/triton that referenced this pull request Sep 19, 2024
This commit is fixing the root cause of the issue that appeared during
an llvm update, and was raised in triton-lang#4212 (comment).

The issue was temporary fixed in triton-lang#4512,
where we set LLVM_ABI_BREAKING_CHECKS to FORCE_OFF constant in order to
not use a non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80).

A further investigation (with chsigg@) found that the issue is that LLVM
is built with clang11, while Triton use a newer version. The ABI issue
is brought up here: llvm/llvm-project#96282 (comment),
but the consensus seemed to be that this setup is rare.

Updated the clang version to 12 in the ubuntu build fixed the issue and
therefore we can revert setting LLVM_ABI_BREAKING_CHECKS constant. I am
additionaly erasing LLVM_ABI_BREAKING_CHECKS in the setup of the other
hardwares (it seems it was not needed) and I am splitting ubuntu and
macOS configurations because it seems cleaner than having a variable
that sets the compiler version for each of them.
karupayun added a commit to openxla/triton that referenced this pull request Sep 19, 2024
This commit is fixing the root cause of the issue that appeared during
an llvm update, and was raised in triton-lang#4212 (comment).

The issue was temporary fixed in triton-lang#4512,
where we set LLVM_ABI_BREAKING_CHECKS to FORCE_OFF constant in order to
not use a non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80).

A further investigation (with chsigg@) found that the issue is that LLVM
is built with clang11, while Triton use a newer version. The ABI issue
is brought up here: llvm/llvm-project#96282 (comment),
but the consensus seemed to be that this setup is rare.

Updated the clang version to 12 in the ubuntu build fixed the issue and
therefore we can revert setting LLVM_ABI_BREAKING_CHECKS constant. I am
additionaly erasing LLVM_ABI_BREAKING_CHECKS in the setup of the other
hardwares (it seems it was not needed) and I am splitting ubuntu and
macOS configurations because it seems cleaner than having a variable
that sets the compiler version for each of them.
karupayun added a commit that referenced this pull request Sep 26, 2024
This commit is fixing the root cause of the issue that appeared during
an llvm update, and was raised in #4212 (comment).

The issue was temporary fixed in #4512,
where we set LLVM_ABI_BREAKING_CHECKS to FORCE_OFF constant in order to
not use a non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80).

A further investigation (with chsigg@) found that the issue is that LLVM
is built with clang11, while Triton use a newer version. The ABI issue
is brought up here: llvm/llvm-project#96282 (comment),
but the consensus seemed to be that this setup is rare.

Updated the clang version to 12 in the ubuntu build fixed the issue and
therefore we can revert setting LLVM_ABI_BREAKING_CHECKS constant. I am
additionaly erasing LLVM_ABI_BREAKING_CHECKS in the setup of the other
hardwares (it seems it was not needed) and I am splitting ubuntu and
macOS configurations because it seems cleaner than having a variable
that sets the compiler version for each of them.
chsigg pushed a commit to openxla/triton that referenced this pull request Oct 7, 2024
The issue raised in triton-lang#4212 (comment) and temporary fixed in triton-lang#4512 has now been fixed upstream.

The non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80) was disabled for clang <= 11 for non-pic builds.

This was removed again in xgupta/llvm-project@5255b81 and we can enable ABI-breaking checks again.

We build LLVM with PIC. If anyone ever builds with non-PIC, they will get a compilation error and need to disable LLVM_ABI_BREAKING_CHECKS manually.
chsigg pushed a commit to openxla/triton that referenced this pull request Oct 7, 2024
The issue raised in triton-lang#4212 (comment) and temporary fixed in triton-lang#4512 has now been fixed upstream.

The non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80) was disabled for clang <= 11 for non-pic builds.

This was removed again in xgupta/llvm-project@5255b81 and we can enable ABI-breaking checks again.

We build LLVM with PIC. If anyone ever builds with non-PIC, they will get a compilation error and need to disable LLVM_ABI_BREAKING_CHECKS manually.
chsigg pushed a commit that referenced this pull request Oct 10, 2024
The issue raised in #4212 (comment) and temporary fixed in #4512 has now been fixed upstream.

The non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80) was disabled for clang <= 11 for non-pic builds.

This was removed again in llvm/llvm-project@def9550 and we can enable ABI-breaking checks again.

We build LLVM with PIC. If anyone ever builds with non-PIC, they will get a compilation error and need to disable LLVM_ABI_BREAKING_CHECKS manually.
sfzhu93 pushed a commit to sfzhu93/triton that referenced this pull request Oct 11, 2024
The issue raised in triton-lang#4212 (comment) and temporary fixed in triton-lang#4512 has now been fixed upstream.

The non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80) was disabled for clang <= 11 for non-pic builds.

This was removed again in llvm/llvm-project@def9550 and we can enable ABI-breaking checks again.

We build LLVM with PIC. If anyone ever builds with non-PIC, they will get a compilation error and need to disable LLVM_ABI_BREAKING_CHECKS manually.
Luosuu pushed a commit to Luosuu/triton that referenced this pull request Nov 13, 2024
The issue raised in triton-lang#4212 (comment) and temporary fixed in triton-lang#4512 has now been fixed upstream.

The non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80) was disabled for clang <= 11 for non-pic builds.

This was removed again in llvm/llvm-project@def9550 and we can enable ABI-breaking checks again.

We build LLVM with PIC. If anyone ever builds with non-PIC, they will get a compilation error and need to disable LLVM_ABI_BREAKING_CHECKS manually.
guacamoleo pushed a commit to guacamoleo/triton that referenced this pull request Nov 14, 2024
The issue raised in triton-lang#4212 (comment) and temporary fixed in triton-lang#4512 has now been fixed upstream.

The non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80) was disabled for clang <= 11 for non-pic builds.

This was removed again in llvm/llvm-project@def9550 and we can enable ABI-breaking checks again.

We build LLVM with PIC. If anyone ever builds with non-PIC, they will get a compilation error and need to disable LLVM_ABI_BREAKING_CHECKS manually.
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
…version (triton-lang#4512)

This PR is setting LLVM_ABI_BREAKING_CHECKS=FORCE_OFF, similar of how we
do it inside Google. In this way, we are forcing the structures that
relies in Hashing.h to use a deterministic seed.

We were not able to update the llvm version. The culprit llvm commit is
llvm/llvm-project@ce80c80,
that basically uses a non-deterministic seed if the previous "Flag" is
set.

There were some discussions in
triton-lang#4338, and the other option
would be to check/replace code and tests that are overly depending on
the determinism of llvm/include/llvm/ADT/Hashing.h (DenseMap and
DenseSet mostly).

I think that the priority is first to be able to update the llvm
version, but if OpenAI is interested I could work on replacing them (for
MapVector), in order to fix all the failing unit tests, and then we can
set the LLVM_ABI_BREAKING_CHECKS again to "WITH_ASSERTS".

- [x] I am not making a trivial change, such as fixing a typo in a
comment.
- [x] I have written a PR description following these
  [rules](https://cbea.ms/git-commit/#why-not-how).
- [x] I have run `pre-commit run --from-ref origin/main --to-ref HEAD`.
- [x] This PR does not need a test because it's just setting a variable
for LLVM.
- [x] I have not added any `lit` tests.

Co-authored-by: Tori Baker <[email protected]>
bertmaher pushed a commit to bertmaher/triton that referenced this pull request Dec 10, 2024
The issue raised in triton-lang#4212 (comment) and temporary fixed in triton-lang#4512 has now been fixed upstream.

The non-deterministic seed inside the hashing function
(included in this llvm commit: llvm/llvm-project@ce80c80) was disabled for clang <= 11 for non-pic builds.

This was removed again in llvm/llvm-project@def9550 and we can enable ABI-breaking checks again.

We build LLVM with PIC. If anyone ever builds with non-PIC, they will get a compilation error and need to disable LLVM_ABI_BREAKING_CHECKS manually.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants