Description
Godbolt link: https://godbolt.org/z/KvoMf5GjY
Given this very short example:
#include <math.h>
inline int SQRT(int arg) { return sqrtf(static_cast<float>(arg)); }
template<typename T>
T foo( T a )
{
return SQRT( (T)a );
}
int main( int argc, char** argv )
{
return foo( argc );
}
clang
generates interesting assembly code with XRay being involved, and the flags -O3 -fno-inline -fxray-instrument -fxray-instruction-threshold=1
being used:
main:
nop word ptr [rax + rax + 512]
nop word ptr [rax + rax + 512]
jmp int foo<int>(int)
int foo<int>(int):
nop word ptr [rax + rax + 512]
nop word ptr [rax + rax + 512]
jmp SQRT(int)
SQRT(int):
nop word ptr [rax + rax + 512]
[...]
ret
nop word ptr cs:[rax + rax + 512]
Both main
and int foo<int>(int)
have proper sleds for XRay instrumentation. However, both the enter and exit sled can be found before the actual function content (i.e. the jmp
instruction).
This causes an issue for tools who want to represent the a proper tree structure of functions being called, e.g. performance tools. One would see something like this:
- ./a.out
- main
- int foo<int>(int)
- SQRT(int)
Instead of
- ./a.out
- main
- int foo<int>(int)
- SQRT(int)
In the case of LULESH with our current (in-development) XRay instrumentation adapter in Score-P, this even caused an inconsistent profile, probably due to similar reasons.
Given that this is a very constructed case, I don't see this as being a huge issue. However, I think this may be a limitation that should be documented somewhere. I can't immediately think of a solution for this, and I think most people will not encounter this issue. Why would someone prevent inlining with -O3 in the first place? (well, me, because I wanted to test the overhead when filtering functions).