-
Notifications
You must be signed in to change notification settings - Fork 0
Xccl process group for Pytorch #1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
You missed xccl register in https://github.com/Chao1Han/pytorch/blob/master/torch/distributed/distributed_c10d.py |
There is some record function call in ProcessGroupNCCL. Could you check whether those needed on XPU? We need to make sure basic profiler can work on XPU distributed. |
@gujinghui @guangyey @zhangxiaoli73 pls help review again. |
TORCH_CHECK(false, "ProcessGroupXCCL::broadcast not implemented"); | ||
} | ||
|
||
c10::intrusive_ptr<Work> allreduce_sparse( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have plan to support sparse. Then could we skip this op in your ops register for XPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
We don't have plan to support sparse. Then could we skip this op in your ops register for XPU?
LGTM. Next step is to add UTs? |
Yes, will create simple ut in test/distributed/test_c10d_xccl.py since just enable allreduce. |
target_compile_definitions(torch_xpu PUBLIC USE_C10D_XCCL) | ||
set_source_files_properties( | ||
${TORCH_SRC_DIR}/csrc/distributed/c10d/ProcessGroupXCCL.cpp | ||
PROPERTIES COMPILE_DEFINITIONS "CCL_ENABLE_ZE;CCL_ENABLE_SYCL") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those flags are just workaround and will be removed after oneCCL gcc build issue is fixed (target to basekit 2025.0).
@Chao1Han Please make sure your PR include: 1) non-MPI launcher, please pass correct env options to oneCCL 2) Error handling. Abort the work(may need to call some oneCCL finalize API) for timeout. |
Add UT and also some implementation details refine (see my comments). @Chao1Han |
blockingWait_ = getCvarBool(TORCH_XCCL_BLOCKING_WAIT, false); | ||
init(); | ||
|
||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You need to check its launcher is IMPI or not. Only set CCL_LOCAL_RANK/CCL_LOCAL_SIZE for non-MPI launcher
std::lock_guard<std::mutex> lock(kvs_mutex); | ||
if (kvs) | ||
return kvs; | ||
std::string storeKey = "ccl_kvs"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xccl_kvs
[reverseSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reversesquareroot(with:name:)?changes=__8&language=objc) were deprecated in favor of [reciprocalSquareRootWithTensor:](https://developer.apple.com/documentation/metalperformanceshadersgraph/mpsgraph/reciprocalsquareroot(_:name:)?changes=__8&language=objc) Without it, following warnings are generated if compiled on recently released MacOS Sequoia: ``` /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:720:35: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 720 | rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ | reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:341:10: note: in instantiation of function template specialization 'at::native::batch_norm_backward_mps(const Tensor &, const Tensor &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, const std::optional<Tensor> &, bool, double, std::array<bool, 3>)::(anonymous class)::operator()<MPSGraph *, CachedGraph *>' requested here 341 | decltype(std::declval<_Fp>()(std::declval<_Args>()...)) | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:351:19: note: while substituting deduced template arguments into function template '__invoke' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _Args = <MPSGraph *, CachedGraph *>] 351 | static decltype(std::__invoke(std::declval<_XFp>(), std::declval<_XArgs>()...)) __try_call(int); | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/invoke.h:357:28: note: while substituting deduced template arguments into function template '__try_call' [with _XFp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, _XArgs = (no value)] 357 | using _Result = decltype(__try_call<_Fp, _Args...>(0)); | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:27:32: note: in instantiation of template class 'std::__invokable_r<void, (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &, MPSGraph *, CachedGraph *>' requested here 27 | __expand_to_true<__enable_if_t<_Pred::value>...> __and_helper(int); | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__type_traits/conjunction.h:38:39: note: while substituting explicitly-specified template arguments into function template '__and_helper' 38 | using _And _LIBCPP_NODEBUG = decltype(std::__and_helper<_Pred...>(0)); | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:828:20: note: (skipping 1 context in backtrace; use -ftemplate-backtrace-limit=0 to see all) 828 | bool = _And< _IsNotSame<__remove_cvref_t<_Fp>, function>, __invokable<_Fp, _ArgTypes...> >::value> | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:841:49: note: in instantiation of default argument for '__callable<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68) &>' required here 841 | using _EnableIfLValueCallable = __enable_if_t<__callable<_Fp&>::value>; | ^~~~~~~~~~~~~~~~ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:851:32: note: in instantiation of template type alias '_EnableIfLValueCallable' requested here 851 | template <class _Fp, class = _EnableIfLValueCallable<_Fp>> | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/usr/include/c++/v1/__functional/function.h:852:25: note: in instantiation of default argument for 'function<(lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68)>' required here 852 | _LIBCPP_HIDE_FROM_ABI function(_Fp); | ^~~~~~~~~~~~~ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68: note: while substituting deduced template arguments into function template 'function' [with _Fp = (lambda at /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:68), $1 = (no value)] 623 | auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { | ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:623:24: note: while substituting deduced template arguments into function template 'LookUpOrCreateCachedGraph' [with T = CachedGraph] 623 | auto cachedGraph = LookUpOrCreateCachedGraph<CachedGraph>(key, [&](auto mpsGraph, auto newCachedGraph) { | ^ /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 | -(MPSGraphTensor *) reverseSquareRootWithTensor:(MPSGraphTensor *) tensor | ^ /Users/malfet/git/pytorch/pytorch/aten/src/ATen/native/mps/operations/Normalization.mm:745:37: warning: 'reverseSquareRootWithTensor:name:' is deprecated: first deprecated in macOS 15.0 [-Wdeprecated-declarations] 745 | rsqrtTensor = [mpsGraph reverseSquareRootWithTensor:varianceEpsTensor name:nil]; | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ | reciprocalSquareRootWithTensor /Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX15.0.sdk/System/Library/Frameworks/MetalPerformanceShadersGraph.framework/Headers/MPSGraphArithmeticOps.h:123:1: note: 'reverseSquareRootWithTensor:name:' has been explicitly marked deprecated here 123 | -(MPSGraphTensor *) reverseSquareRootWithTensor:(MPSGraphTensor *) tensor | ^ 2 warnings generated. ``` Pull Request resolved: pytorch#136292 Approved by: https://github.com/kit1980
Summary: att Test Plan: python test/test_quantization.py -k TestObserver Reviewers: Subscribers: Tasks: Tags: Differential Revision: [D62909821](https://our.internmc.facebook.com/intern/diff/D62909821) Pull Request resolved: pytorch#136238 Approved by: https://github.com/tarun292
Fix crash that was happening in pytorch#128095, because we were trying to extract a constant incorrectly. Pull Request resolved: pytorch#136010 Approved by: https://github.com/yanboliang, https://github.com/anijain2305, https://github.com/jansel
Fixes pytorch#127049 There's already a meta func in `meta_registrations.py` for `add_` and `sub_` methods. I added a second meta function for error checking, i.e `int.add/sub_(float)` and `bool.add/sub_(other types)` . Also the corresponding test with Dynamo passes, removed `@xfailIfTorchDynamo`. Pull Request resolved: pytorch#135864 Approved by: https://github.com/williamwen42
Fixes pytorch#128551 As this has been failing in trunk for a while and there is no owner yet to fix it properly. Pull Request resolved: pytorch#136252 Approved by: https://github.com/andrewkho
…ytorch#134568) When tensor folding occurs during matmul operation returned tensor is a view. This can cause issues when matmul is used inside a custom function and such view is then returned as output. Then it cannot be modified inplace and causes errors. It can be especially problematic when after such function inplace allreduce is performed. Issue is resolved when unsafe_view is returned from matmul instead. This solution aligns matmul decomposition with eager implementation in such a way that a non view tensor is returned. Test included in this PR reproduces the issue. Pull Request resolved: pytorch#134568 Approved by: https://github.com/zou3519
…ations (pytorch#136288) Summary: To facilitate PSS-2 upgrade, this uses `ndt.NDArray` instead of `nd.ndarray` in type annotations. In Numpy-1.19 (PSS-1) it's an alias to `nd.ndarray` -- a noop. In Numpy-1.24, `ndt.NDArray` a proper generic type, and without this change uses of `nd.ndarray` generate this Pyre type error: ```counterexample Invalid type parameters [24]: Generic type `np.ndarray` expects 2 type parameters. ``` Test Plan: Sandcastle plus visual inspection Differential Revision: D62977370 Pull Request resolved: pytorch#136288 Approved by: https://github.com/kit1980
This reverts commit 083c914. Reverted pytorch#136213 on behalf of https://github.com/jeanschmidt due to Seems to have introduced regressions in rocm signals ([comment](pytorch#136213 (comment)))
``` CUDA_VISIBLE_DEVICES=2,3,6,7 pytest test/distributed/_composable/test_composability/test_2d_composability.py -k test_train_parity_2d_transformer ``` Differential Revision: [D62964658](https://our.internmc.facebook.com/intern/diff/D62964658) Pull Request resolved: pytorch#136237 Approved by: https://github.com/weifengpy
…6142) Summary: X-link: pytorch/benchmark#2454 This adds structured logging overhead at a per compile basis to compilation metrics. To do so, we track the frame_id_frame_compile_id that trace_structured uses to categorize compiles, and use that as the key in our timing table. Implementation notes: - If there's times we call trace_structured without a compile id, the time won't be measured. Not really a good way around that today given the compile id framework of compilation metrics. Strobelight is still the best way to measure on a per job basis. - We don't actually measure the time it takes to log the compilation metrics itself. Fundamentally, it's not possible to log this properly if we're storing the logging number *in* compilation metrics, since there's no way to measure it before we do it(unless we want discrepancies between dynamo_compile and tlparse, which seems suboptimal). Hopefully for a large job, the cost of structured_logging compilation metrics itself is small. - I wanted to use frame_phase_timing here, but there's a bunch of ids to iron out, and I don't really want to deal with that headache. compilation_time_metrics is sort of what I want, but that isn't by frame/compile id, so it's also a bit off. Putting it into torch.logging as a separate thing so logging tracks its own overhead seems fine, though. Test Plan: Run benchmarks/nanogpt and staging logger. See that the new compilation metric is logged to the staged dynamo_compile table: https://fburl.com/scuba/logger_staging_jjwu_30582a48f1ff9cf5f4ac50a4c40af/xazjg5xq Note that the sum(structured_logging_overhead_s) / sum(entire_frame_compile_time) = 8.387 / 124.278 = 6%, which seems reasonable as the overhead for a small compilation like this. You can also look at samples for a more detailed log of this. Reviewed By: oulgen Differential Revision: D62643611 Pull Request resolved: pytorch#136142 Approved by: https://github.com/bobrenjc93
Signed-off-by: Bob Ren <[email protected]> I was chatting with @jamesjwu about strategies to learn the code and he suggested adding types to some files. This stack of PRs adds types to _sympy/functions.py Pull Request resolved: pytorch#136205 Approved by: https://github.com/Skylion007, https://github.com/jamesjwu
Pull Request resolved: pytorch#136245 Approved by: https://github.com/malfet
…s.h (pytorch#134546) When size value equal to one, tensor strides value need be skipped to compare. @ezyang Pull Request resolved: pytorch#134546 Approved by: https://github.com/janeyx99
…nter (pytorch#136182) Summary: Add a third mode where we only print kernel names without dumping any intermediate actual tensor value info. It can be helpful in quickly identifying the troublesome kernels in CUDA IMA issues. thanks ColinPeppler and henrylhtsang for this "feature request". Test Plan: The output can look like this if set the `AOT_INDUCTOR_DEBUG_INTERMEDIATE_VALUE_PRINTER=3`: {F1871629091} Differential Revision: D62791371 Pull Request resolved: pytorch#136182 Approved by: https://github.com/henrylhtsang
…#136175) Pull Request resolved: pytorch#136175 Approved by: https://github.com/ezyang
Follows pytorch#134537 Pull Request resolved: pytorch#134829 Approved by: https://github.com/ezyang
…rch#136266) This fixes a subset of issues for dynamic shapes + DTensor. It's pretty easy to run into other issues - it's likely that we need pytorch#125941 to land for DTensor + dynamic shapes to work more generally. I ended up writing a test that had dynamic shape inputs but not dynamic shape outputs in order to properly test this fix Pull Request resolved: pytorch#136266 Approved by: https://github.com/ezyang, https://github.com/yf225
…a 12.6 (pytorch#136321) To prepare for future cuda updates. Pull Request resolved: pytorch#136321 Approved by: https://github.com/Skylion007, https://github.com/eqy
Summary: This logs all operations when tracing log level is enabled for the `TCPStoreLibUvBackend`. This is very useful for debugging collective operations when issues occur as it logs all hosts and the keys that they're modifying. To minimize total data we only log the keys and not the values This changes the C10D_* macros to be much more efficient -- previously we would always format the log string even if they would never be printed which is very wasteful for detailed tracing. This now gates them with an if statement to achieve the same behavior with no overhead Test Plan: ``` TORCH_DISTRIBUTED_DEBUG=DETAIL torchrun --nnodes 1 --nproc_per_node 1 --no-python /bin/bash -c "echo foo" ``` ``` I0919 09:26:52.352013 34271 TCPStore.cpp:285] [c10d - debug] The server has started on port = 29500. I0919 09:26:52.352246 34271 socket.cpp:783] [c10d - debug] The client socket will attempt to connect to an IPv6 address of (127.0.0.1, 29500). I0919 09:26:52.352241 36903 TCPStoreLibUvBackend.cpp:1173] [c10d - debug] Uv main loop running I0919 09:26:52.352308 34271 socket.cpp:854] [c10d - trace] The client socket is attempting to connect to [localhost]:29500. I0919 09:26:52.353633 34271 socket.cpp:945] [c10d] The client socket has connected to [localhost]:29500 on SocketImpl(fd=41, addr=[localhost]:45646, remote=[localhost]:29500). I0919 09:26:52.354422 34271 TCPStore.cpp:321] [c10d - debug] TCP client connected to host 127.0.0.1:29500 I0919 09:26:52.354558 36903 TCPStoreLibUvBackend.cpp:774] [c10d - trace] validate magic:1015412686 address:[localhost]:45646 I0919 09:26:52.354638 36903 TCPStoreLibUvBackend.cpp:789] [c10d - trace] ping nonce:34271 address:[localhost]:45646 I0919 09:26:52.356122 36903 TCPStoreLibUvBackend.cpp:866] [c10d - trace] add key:init/ val:1 address:[localhost]:45646 I0919 09:26:52.356308 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.356410 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:init/ address:[localhost]:45646 I0919 09:26:52.358688 36903 TCPStoreLibUvBackend.cpp:808] [c10d - trace] set key:/none/torchelastic/role_info/0 address:[localhost]:45646 I0919 09:26:52.360177 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.360296 36903 TCPStoreLibUvBackend.cpp:1004] [c10d - trace] multi_get key_count:1 address:[localhost]:45646 I0919 09:26:52.362076 36903 TCPStoreLibUvBackend.cpp:1036] [c10d - trace] multi_set key_count:1 address:[localhost]:45646 I0919 09:26:52.364001 36903 TCPStoreLibUvBackend.cpp:930] [c10d - trace] wait key_count:1 address:[localhost]:45646 I0919 09:26:52.364091 36903 TCPStoreLibUvBackend.cpp:846] [c10d - trace] get key:/none/torchelastic/assigned_ranks/0 address:[localhost]:45646 ``` Differential Revision: D62924454 Pull Request resolved: pytorch#136320 Approved by: https://github.com/c-p-i-o, https://github.com/XilunWu
…ers for a better debugging experience (pytorch#135913) Summary: The change involves passing the expired timers to the log_debug_info_for_expired_timers function after to_json() has been applied . This change is made to provide a better debugging experience for the user. Test Plan: unit tests Reviewed By: gag1jain Differential Revision: D62408767 Pull Request resolved: pytorch#135913 Approved by: https://github.com/gag1jain
…ytorch#139659) ### Motivation Today, watchdog only reports that it found a collective timeout: ``` [rank1]:[E1104 14:02:18.767594328 ProcessGroupNCCL.cpp:688] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=200, NumelOut=200, Timeout(ms)=5000) ran for 5096 milliseconds before timing out. ``` While this is nice, it is hard to associate the error with user's program or library stack. ### This PR This PR gives watchdog the ability to report the call-time stack of the collective, so that it would be easier to track the error back to the program's behavior. The call-time stack was recorded by Flight Recorder with minimal overhead (for details, please read this [doc](https://dev-discuss.pytorch.org/t/fast-combined-c-python-torchscript-inductor-tracebacks/1158) written by @zdevito ). In `ProcessGroupNCCL`, we are only tracking / reporting the python part so that it fits most PyTorch users. ### Demo [stack_demo.py](https://gist.github.com/kwen2501/6758e18d305d67fc6f3f926217825c09). ``` TORCH_NCCL_TRACE_BUFFER_SIZE=100 torchrun --nproc-per-node 2 stack_demo.py ``` `TORCH_NCCL_TRACE_BUFFER_SIZE` is for turning on the Flight Recorder. Output: ``` [rank0]:[E1104 14:19:27.591610653 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_reduce from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:2696 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 bar from /data/users/kw2501/sync_async/repro.py:15 #3 foo from /data/users/kw2501/sync_async/repro.py:24 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 [rank1]:[E1104 14:19:27.771430164 ProcessGroupNCCL.cpp:695] Stack trace of the timedout collective operation: #0 all_gather_into_tensor from /data/users/kw2501/pytorch/torch/distributed/distributed_c10d.py:3630 #1 wrapper from /data/users/kw2501/pytorch/torch/distributed/c10d_logger.py:83 #2 baz from /data/users/kw2501/sync_async/repro.py:20 #3 foo from /data/users/kw2501/sync_async/repro.py:26 #4 main from /data/users/kw2501/sync_async/repro.py:34 #5 <module> from /data/users/kw2501/sync_async/repro.py:40 ``` From the log above, we can tell that `bar()` and `baz()` are the places where the two ranks divert. Pull Request resolved: pytorch#139659 Approved by: https://github.com/wconstab, https://github.com/fduwjj
See pytorch#140725 (comment) Running `torch.mps.synchronize()` after metal kernel resulted in infinite wait inside `[_MTLCommandBuffer waitUntilCompleted]` ``` (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00000001aa919084 Metal`pthread_cond_wait + 12 frame #1: 0x00000001aa78b1b4 Metal`-[_MTLCommandBuffer waitUntilCompleted] + 84 frame #2: 0x00000001032bf358 libtorch_python.dylib`torch::mps::MPSModule_deviceSynchronize(_object*, _object*) + 40 frame #3: 0x0000000100e94c20 Python`cfunction_vectorcall_NOARGS + 100 frame #4: 0x0000000100e389b8 Python`PyObject_Vectorcall + 92 frame #5: 0x0000000100f61e38 Python`_PyEval_EvalFrameDefault + 19040 frame #6: 0x0000000100f5d180 Python`PyEval_EvalCode + 200 frame #7: 0x0000000100fcd1a4 Python`run_eval_code_obj + 104 frame #8: 0x0000000100fccbe4 Python`run_mod + 168 frame #9: 0x0000000100fcb518 Python`pyrun_file + 164 frame #10: 0x0000000100fca854 Python`_PyRun_SimpleFileObject + 256 frame #11: 0x0000000100fca4e8 Python`_PyRun_AnyFileObject + 80 frame #12: 0x0000000100ff2028 Python`pymain_run_file_obj + 164 frame #13: 0x0000000100ff1ce4 Python`pymain_run_file + 72 frame #14: 0x0000000100ff0f74 Python`Py_RunMain + 988 frame #15: 0x0000000100ff1564 Python`pymain_main + 304 frame #16: 0x0000000100ff1604 Python`Py_BytesMain + 40 frame #17: 0x000000019f630274 dyld`start + 2840 ``` Pull Request resolved: pytorch#141296 Approved by: https://github.com/huydhn
…143550) # Motivation Fix pytorch#143543 # Solution We should raise python exception instead of aborting... # Additional Context without this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) terminate called after throwing an instance of 'c10::Error' what(): device is out of range, device is 2, total number of device is 2. Exception raised from check_device_index at /home/dvrogozh/git/pytorch/pytorch/c10/xpu/XPUFunctions.h:36 (most recent call first): frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) + 0xac (0x7f30707eb95c in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0xf3 (0x7f307078fc57 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10.so) frame #2: <unknown function> + 0x19a3e (0x7f3070c2ba3e in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #3: c10::xpu::getCurrentXPUStream(signed char) + 0x2f (0x7f3070c2c83f in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #4: <unknown function> + 0x1ca35 (0x7f3070c2ea35 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libc10_xpu.so) frame #5: <unknown function> + 0x653f15 (0x7f3083391f15 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) frame #6: <unknown function> + 0x39e5f2 (0x7f30830dc5f2 in /home/dvrogozh/git/pytorch/pytorch/torch/lib/libtorch_python.so) <omitting python frames> frame pytorch#20: <unknown function> + 0x29d90 (0x7f308b19bd90 in /lib/x86_64-linux-gnu/libc.so.6) frame pytorch#21: __libc_start_main + 0x80 (0x7f308b19be40 in /lib/x86_64-linux-gnu/libc.so.6) Aborted (core dumped) ``` with this PR: ```python >>> import torch >>> torch.accelerator.current_stream(torch.accelerator.device_count()) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/home/pt-gpu/4T-4652/guangyey/stock-pytorch/torch/accelerator/__init__.py", line 123, in current_stream return torch._C._accelerator_getStream(device_index) RuntimeError: The device index is out of range. It must be in [0, 2), but got 2. ``` Pull Request resolved: pytorch#143550 Approved by: https://github.com/EikanWang, https://github.com/dvrogozh, https://github.com/albanD
…pytorch#144120) (pytorch#146372) Summary: # Summary ### Sticky points Cuda-graph rng handling has changed / deviated from original implementation. We will be left with a dangling 'offset' val and confusing naming due to BC ## Dependencies - Flash PR: Dao-AILab/flash-attention#1419 ### Other Points - The BC linter is complaining about losing generate.py and its functions which is not real BC surface cc albanD imported-using-ghimport Test Plan: Imported from OSS Building in dev `buck build @//mode/dev-nosan -c fbcode.nvcc_arch=h100a //caffe2:ATen-cu --show-full-output ` I and Nming the .so I do see that the flash symbols are correctly named: ``` 0000000001c3dfb0 t pytorch_flash::run_mha_bwd(pytorch_flash::Flash_bwd_params&, CUstream_st*)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c36080 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c360e0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#2}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const 0000000001c35fc0 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#6}::operator()() const 0000000001c36020 t pytorch_flash::run_mha_fwd(pytorch_flash::Flash_fwd_params&, CUstream_st*, bool)::$_0::operator()() const::{lambda()#1}::operator()() const::{lambda()#1}::operator()() const::{lambda()#7}::operator()() const ``` Reviewed By: vkuzo Differential Revision: D68502879 Pulled By: drisspg Pull Request resolved: pytorch#146372 Approved by: https://github.com/jbschlosser
Which inherits from `RuntimeError` and contains `error_code`, which in case of CUDA should contain error returned by `cudaGetLastError` `torch::detail::_new_accelerator_error_object(c10::AcceleratorError&)` follows the pattern of CPython's [`PyErr_SetString`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L282), namely - Convert cstr into Python string with `PyUnicode_FromString` - Create new exception object using `PyObject_CallOneArg` just like it's done in [`_PyErr_CreateException`](https://github.com/python/cpython/blob/cb8a72b301f47e76d93a7fe5b259e9a5758792e1/Python/errors.c#L32) - Set `error_code` property using `PyObject_SetAttrString` - decref all temporary references Test that it works and captures CPP backtrace (in addition to CI) by running ```python import os os.environ['TORCH_SHOW_CPP_STACKTRACES'] = '1' import torch x = torch.rand(10, device="cuda") y = torch.arange(20, device="cuda") try: x[y] = 2 print(x) except torch.AcceleratorError as e: print("Exception was raised", e.args[0]) print("Captured error code is ", e.error_code) ``` which produces following output ``` Exception was raised CUDA error: device-side assert triggered CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect. For debugging consider passing CUDA_LAUNCH_BLOCKING=1 Compile with `TORCH_USE_CUDA_DSA` to enable device-side assertions. Exception raised from c10_cuda_check_implementation at /home/ubuntu/pytorch/c10/cuda/CUDAException.cpp:41 (most recent call first): C++ CapturedTraceback: #4 std::_Function_handler<std::shared_ptr<c10::LazyValue<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > > const> (), c10::SetStackTraceFetcher(std::function<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > ()>)::{lambda()#1}>::_M_invoke(std::_Any_data const&) from Logging.cpp:0 #5 c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >) from ??:0 #6 c10::cuda::c10_cuda_check_implementation(int, char const*, char const*, int, bool) [clone .cold] from CUDAException.cpp:0 #7 void at::native::gpu_kernel_impl<at::native::AbsFunctor<float> >(at::TensorIteratorBase&, at::native::AbsFunctor<float> const&) [clone .isra.0] from tmpxft_000191fc_00000000-6_AbsKernel.cudafe1.cpp:0 #8 at::native::abs_kernel_cuda(at::TensorIteratorBase&) from ??:0 #9 at::Tensor& at::native::unary_op_impl_with_complex_to_float_out<at::native::abs_stub_DECLARE_DISPATCH_type>(at::Tensor&, at::Tensor const&, at::native::abs_stub_DECLARE_DISPATCH_type&, bool) [clone .constprop.0] from UnaryOps.cpp:0 #10 at::(anonymous namespace)::(anonymous namespace)::wrapper_CUDA_out_abs_out(at::Tensor const&, at::Tensor&) from RegisterCUDA_0.cpp:0 #11 at::_ops::abs_out::call(at::Tensor const&, at::Tensor&) from ??:0 #12 at::native::abs(at::Tensor const&) from ??:0 #13 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeExplicitAutograd__abs>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeExplicitAutograd_0.cpp:0 #14 at::_ops::abs::redispatch(c10::DispatchKeySet, at::Tensor const&) from ??:0 #15 torch::autograd::VariableType::(anonymous namespace)::abs(c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #16 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (c10::DispatchKeySet, at::Tensor const&), &torch::autograd::VariableType::(anonymous namespace)::abs>, at::Tensor, c10::guts::typelist::typelist<c10::DispatchKeySet, at::Tensor const&> >, at::Tensor (c10::DispatchKeySet, at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from VariableType_1.cpp:0 #17 at::_ops::abs::call(at::Tensor const&) from ??:0 pytorch#18 at::native::isfinite(at::Tensor const&) from ??:0 pytorch#19 c10::impl::wrap_kernel_functor_unboxed_<c10::impl::detail::WrapFunctionIntoFunctor_<c10::CompileTimeFunctionPointer<at::Tensor (at::Tensor const&), &at::(anonymous namespace)::(anonymous namespace)::wrapper_CompositeImplicitAutograd__isfinite>, at::Tensor, c10::guts::typelist::typelist<at::Tensor const&> >, at::Tensor (at::Tensor const&)>::call(c10::OperatorKernel*, c10::DispatchKeySet, at::Tensor const&) from RegisterCompositeImplicitAutograd_0.cpp:0 pytorch#20 at::_ops::isfinite::call(at::Tensor const&) from ??:0 pytorch#21 torch::autograd::THPVariable_isfinite(_object*, _object*, _object*) from python_torch_functions_2.cpp:0 pytorch#22 PyObject_CallFunctionObjArgs from ??:0 pytorch#23 _PyObject_MakeTpCall from ??:0 pytorch#24 _PyEval_EvalFrameDefault from ??:0 pytorch#25 _PyObject_FastCallDictTstate from ??:0 pytorch#26 _PyStack_AsDict from ??:0 pytorch#27 _PyObject_MakeTpCall from ??:0 pytorch#28 _PyEval_EvalFrameDefault from ??:0 pytorch#29 _PyFunction_Vectorcall from ??:0 pytorch#30 _PyEval_EvalFrameDefault from ??:0 pytorch#31 _PyFunction_Vectorcall from ??:0 pytorch#32 _PyEval_EvalFrameDefault from ??:0 pytorch#33 _PyFunction_Vectorcall from ??:0 pytorch#34 _PyEval_EvalFrameDefault from ??:0 pytorch#35 PyFrame_GetCode from ??:0 pytorch#36 PyNumber_Xor from ??:0 pytorch#37 PyObject_Str from ??:0 pytorch#38 PyFile_WriteObject from ??:0 pytorch#39 _PyWideStringList_AsList from ??:0 pytorch#40 _PyDict_NewPresized from ??:0 pytorch#41 _PyEval_EvalFrameDefault from ??:0 pytorch#42 PyEval_EvalCode from ??:0 pytorch#43 PyEval_EvalCode from ??:0 pytorch#44 PyUnicode_Tailmatch from ??:0 pytorch#45 PyInit__collections from ??:0 pytorch#46 PyUnicode_Tailmatch from ??:0 pytorch#47 _PyRun_SimpleFileObject from ??:0 pytorch#48 _PyRun_AnyFileObject from ??:0 pytorch#49 Py_RunMain from ??:0 pytorch#50 Py_BytesMain from ??:0 pytorch#51 __libc_init_first from ??:0 pytorch#52 __libc_start_main from ??:0 pytorch#53 _start from ??:0 Captured error code is 710 ``` Pull Request resolved: pytorch#152023 Approved by: https://github.com/eqy, https://github.com/mradmila, https://github.com/ngimel ghstack dependencies: pytorch#154436
Use uint64_t index types to avoid ``` torch_np/numpy_tests/core/test_einsum.py::TestEinsum::test_einsum_broadcast /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24: runtime error: signed integer overflow: 9223365439786057728 + 13194139533312 cannot be represented in type 'long' #0 0x7f30d26166ba in std::enable_if<std::is_same_v<long, long>, void>::type at::native::cpublas::(anonymous namespace)::gemm_notrans_<long, long, long>(long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:132:24 #1 0x7f30d26166ba in void at::native::cpublas::(anonymous namespace)::gemm_core_<long, long, long>(at::native::TransposeType, at::native::TransposeType, long, long, long, long, long const*, long, long const*, long, long, long*, long) /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:451:12 #2 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const::'lambda2'()::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 #3 0x7f30d25fba1b in at::native::cpublas::(anonymous namespace)::cpublas_gemm_impl(c10::ScalarType, at::native::TransposeType, at::native::TransposeType, long, long, long, c10::Scalar const&, void const*, long, void const*, long, c10::Scalar const&, void*, long)::$_2::operator()() const /var/lib/jenkins/workspace/aten/src/ATen/native/cpu/BlasKernel.cpp:485:3 ``` Pull Request resolved: pytorch#154809 Approved by: https://github.com/soulitzer
Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below ``` (lldb) up frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10 556 bool has_names() const { 557 // If a user is using unnamed tensors, then we can short-circuit right here. 558 // Otherwise, impl::has_names attempts to retrieve names. -> 559 if (!impl_->has_named_tensor_meta()) { 560 return false; 561 } 562 return impl::has_names(unsafeGetTensorImpl()); (lldb) up frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3 20 int64_t dimname_to_position(const Tensor& tensor, Dimname dim) { 21 TORCH_CHECK(dim.type() != NameType::WILDCARD, 22 "Please look up dimensions by name, got: name = None."); -> 23 TORCH_CHECK(tensor.has_names(), 24 "Name ", dim, " not found in ", toDimnameRepr(tensor), "."); 25 const auto names = tensor.names(); 26 ``` TODOs: - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable) - Replace `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests Fixes pytorch#155306 Pull Request resolved: pytorch#155383 Approved by: https://github.com/cyyever, https://github.com/ezyang ghstack dependencies: pytorch#155382
No description provided.