Enable USE_XCCL=ON by default when building PyTorch XPU binary #14

Chao1Han · 2025-06-03T02:16:16Z

…or XPU

Fixes #ISSUE_NUMBER

zhangxiaoli73 · 2025-06-03T03:13:16Z

Enable USE_XCCL=ON by default when building PyTorch XPU binary

…mplex (pytorch#149692) Fixes pytorch#149625 For the case mentioned in the issue, will get: ``` RuntimeError: Only supports floating-point dtypes, but found: ComplexDouble ``` Pull Request resolved: pytorch#149692 Approved by: https://github.com/malfet

…ytorch#154885) This is mainly following how it is done for torch_key. Error was: ``` UnicodeDecodeError: 'utf-8' codec can't decode bytes in position 0-1: invalid continuation byte ``` Pull Request resolved: pytorch#154885 Approved by: https://github.com/jingsh, https://github.com/mlazos

… overhead (pytorch#154764)" This reverts commit 7dee899. Reverted pytorch#154764 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](pytorch#154769 (comment)))

…tion event (pytorch#154769)" This reverts commit 409c396. Reverted pytorch#154769 on behalf of https://github.com/seemethere due to This fails internal tests see [fburl.com/diff/67gyp7gp](https://fburl.com/diff/67gyp7gp) ([comment](pytorch#154769 (comment)))

Summary: The goal of this PR and future follow-up PRs is to group a set of header files required by AOTInductor Standalone in a separate directory, ensuring they are implemented in a header-only manner. Test Plan: CI Bifferential Revision: D75756619 Pull Request resolved: pytorch#154850 Approved by: https://github.com/janeyx99

…h#154929) During the integration fr with gloo I found that put all logic inside one cpp with both build Macro does not work in the current linkage set up in the bazil file. If we put the cpp in the libtorch_cpu, then cuda side build will fail, if we put both we get complaint about ld.lld: error: duplicate symbol: typeinfo for c10d::DebugInfoWriter. To fix this, we need to move the common logic into another header file and we use different cpp file for cpu and cuda so that fr can be used in both cases. Pull Request resolved: pytorch#154929 Approved by: https://github.com/kwen2501

For graph partition, `write_get_raw_stream_header_once` is done once so the autotune code may not have the header. This PR additionally calls `write_get_raw_stream_header` in `codegen_device_guard_enter` before `get_raw_stream` is used. Pull Request resolved: pytorch#154698 Approved by: https://github.com/oulgen

Pull Request resolved: pytorch#154863 Approved by: https://github.com/Mingming-Ding

…52819) As per comment in pytorch#111471 (comment) the tests are failing due to hypothesis. This PR adds a skip to those tests. Pull Request resolved: pytorch#152819 Approved by: https://github.com/eqy

pytorch#154764) We observed that guard overhead at runtime using profiler traces was higher than reported in this profiling function at the compile time. After investigation, we found that f_locals are already in cache and that was causing the guard overhead to be way smaller while profiling during the compilation. To be more realistic, we flush the cache here. Profiling the guard overhead during compilation (in addition to at runtime) allows faster iteration time, and logging in tlparse and internal databases. Pull Request resolved: pytorch#154764 Approved by: https://github.com/zou3519, https://github.com/jansel, https://github.com/StrongerXi

…ytorch#153723) In order to take the globally best tiling, we need to normalize all the node read and writes to a common iteration space. This first pr finds a common split among nodes in a fused scheduler node, and then normalizes reads and writes to the common split. Pull Request resolved: pytorch#153723 Approved by: https://github.com/jansel

That should make it faster than MPSGraph implementation, but also improves accuracy for small inputs, by using the algorithm described in [What Every Computer Scientist Should Know About Floating-Point Arithmetic](https://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html#1202), i.e. $log(1+x) = \frac{x * log(1+x)}{(1 + x) - 1}$ if $1 +x \neq 1$ else just $x$ Also tried using first 3 elements of Taylor series in Horner's form which also seems to work fine, i.e. $log(1+x) \approx x * (1 -x (\frac{1}{2} - \frac{x}{3}))$ Replaced less accurate log1p implementation in `c10/metal/special_math.h` with generic one. Parametrize and modify regression test to check for accuracy of small values TODOs: - Do proper implementation for complex values as well, perhaps using https://github.com/ml-explore/mlx/blob/0408ba0a768a3493fc3e12262162eca2e55346f0/mlx/backend/metal/kernels/utils.h#L339 - May be implement it using Remez-like algorithm documented here https://github.com/freebsd/freebsd-src/blob/207f3b2b25eaa0f9d32699e664b139e5e40e5450/lib/msun/src/s_log1pf.c#L37 - Or use llvm's implementation from https://github.com/llvm/llvm-project/blob/f393986b53b108457529213c1559346fdb8120ae/libclc/clc/lib/generic/math/clc_log1p.inc#L22 - Benchmark which algorithm is faster and delivers better accuracy Pull Request resolved: pytorch#154936 Approved by: https://github.com/dcci, https://github.com/Skylion007

As the rest of the torch uses it, test should rely on it as well Pull Request resolved: pytorch#154946 Approved by: https://github.com/cyyever, https://github.com/Skylion007

Analyze memory expressions to see if they contain a coalescing symbol. Pull Request resolved: pytorch#153730 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723

…arch64 (pytorch#150106) Adds optional variable OPENBLAS_VERSION to `.ci/docker/common/install_openblas.sh` used to define which version of OpenBLAS to install. Adds argument to `Dockerfile_2_28_aarch64` image. Pull Request resolved: pytorch#150106 Approved by: https://github.com/aditew01, https://github.com/fadara01, https://github.com/malfet Co-authored-by: Fadi Arafeh <[email protected]>

Find variables that coalesce the reads and writes and score the total size. If uncoalesced memory expressions are found, look for additional tiling of variables which will coalesce memory accesses. For instance - for the following expression: `(32*p0) // 2048`, tiling p0 by 64 will make this expression coalesced. Pull Request resolved: pytorch#153748 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723, pytorch#153730

This is a follow-up on the previous dtensor redistribute PR: pytorch#150740, which enables SimpleFSDP's mixed-precision training. In the most recent integration in TorchTitan: pytorch/torchtitan#1250, we found some discrepancies between SimpleFSDP's `fully_shard` and `replicate` modes when MPT is enabled. After debugging, I found the problem is in dtensor redistribute --`local_tensor` is taken out again from the original `input`. Thus, the dtensor used for communication has its original precision instead of using `forward_dtype`. This PR fixes this issue and corrects previously added test cases. After fixing the bug, the loss curves of `fully_shard` and `replicate` mode match perfectly. ![loss](https://github.com/user-attachments/assets/a8faddae-a476-48c0-a411-3fe04d2233bd) Pull Request resolved: pytorch#154975 Approved by: https://github.com/tianyu-l

Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. Pull Request resolved: pytorch#154538 Approved by: https://github.com/ngimel

Removing https://pytorch.org/docs/stable/torch.compiler_best_practices_for_backends.html per torch.compile audit Pull Request resolved: pytorch#154572 Approved by: https://github.com/williamwen42, https://github.com/svekars

Fix pytorch#154373, pytorch#154391, pytorch#154408, pytorch#154443, pytorch#154481 Because MultiProcContinousTest [now executes the tests with 8 GPUs instead of 2](pytorch#153653), our PP tests comparing gradients have become flakier due to the longer pipeline. The gradients are still close but we need to relax the tolerance. Pull Request resolved: pytorch#154856 Approved by: https://github.com/Skylion007

This is a first quick prototyping for FR integration for gloo. Few features gaps: - Input/Output numels for each collective - Whether to use c10::Event or where to use it. - Where to dump the FR traces. (The dump api is provided in this PR) Differential Revision: [D75803601](https://our.internmc.facebook.com/intern/diff/D75803601) Pull Request resolved: pytorch#152614 Approved by: https://github.com/d4l3k ghstack dependencies: pytorch#154929

Update issue template for binary data and numerical notes. Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#154857 Approved by: https://github.com/Skylion007, https://github.com/malfet

Trivial charge, but I want pyright to stop yelling at me Pull Request resolved: pytorch#154927 Approved by: https://github.com/cyyever, https://github.com/Skylion007

…h#154504) ## Summary Adds missing type annotations to `torch.nn.init` and removes `# mypy: allow-untyped-defs` since all functions are now properly typed. ## Changes - Added missing type annotations to initialization functions in the module. - Added missing typing imports: `Any`, `Callable`, `Union` - Removed `# mypy: allow-untyped-defs` comment - Create Literal types for kaiming initialization mode and nonlinearity. - Created `__all__` ## Why Better IDE support, catches type errors earlier, and brings the module up to PyTorch's typing standards. No runtime changes - purely additive typing improvements. Tested with existing test suite and lintrunner. Pull Request resolved: pytorch#154504 Approved by: https://github.com/Skylion007

By changing the functor to looks as follows ```metal struct xlog1py_functor { template <typename T, enable_if_t<is_floating_point_v<T>, bool> = true> inline T operator()(const T a, const T b) { return static_cast<T>(c10::metal::xlog1py(a, b)); } template <typename T, enable_if_t<is_integral_v<T>, bool> = true> inline float operator()(const T a, const T b) { return c10::metal::xlog1py(float(a), float(b)); } }; ``` Repeat the same for `zeta`, `chebyshev_polynomial_[tuvw]_functor` and `hermite_polynomial_h[e]_functor` Pull Request resolved: pytorch#155002 Approved by: https://github.com/Skylion007, https://github.com/dcci ghstack dependencies: pytorch#154936

Add comprehensive module docstring explaining the tracing rules and policies that govern TorchDynamo's compilation decisions, including skip rules, inlining policies, and library-specific handling. Originally generated by claude but reviewed and edited by me. Pull Request resolved: pytorch#155401 Approved by: https://github.com/williamwen42

Add comprehensive module docstring explaining side effect tracking and management, including mutation tracking, context changes, aliasing, and state preservation during symbolic execution. Originally generated by claude but reviewed and edited by me. Pull Request resolved: pytorch#155403 Approved by: https://github.com/williamwen42

1. Enable strided inputs 2. Implement "2d/2d", "3d/2d" and "3d/3d" combinations of inputs 3. Fix non-TMA load variant 4. Replace experimental_device_tensormap_create2d with _experimental_make_tensor_descriptor 5. Fix cases when group size along K dimension is not multiple of block size along K 6. Updated meta registration 7. Update synthetic offsets creation Pull Request resolved: pytorch#150944 Approved by: https://github.com/ngimel

Add comprehensive module docstring explaining built-in function and type variable tracking, including handling of Python built-ins, type constructors, operators, and special constructs during symbolic execution. Originally generated by claude but reviewed and edited by me. Pull Request resolved: pytorch#155402 Approved by: https://github.com/Skylion007 ghstack dependencies: pytorch#155403

Pull Request resolved: pytorch#155425 Approved by: https://github.com/ezyang

This reverts commit 2596e3d. Reverted pytorch#154575 on behalf of https://github.com/clee2000 due to broke inductor/test_op_dtype_prop.py::TestCaseCUDA::test_op_dtype_propagation_add_cuda_int32 [GH job link](https://github.com/pytorch/pytorch/actions/runs/15510656657/job/43673763835) [HUD commit link](https://hud.pytorch.org/pytorch/pytorch/commit/2596e3d0617852469241be8777cf46db5c83928c), note for self: bad TD ([comment](pytorch#154575 (comment)))

…tion (pytorch#154821) Fixes pytorch#154674 Addresses an issue where `torch.export` does not correctly preserve Python `Enum` types during the save/load round-trip. Previously, Enum inputs were serialized by value only, causing their type to be lost after deserialization. Pull Request resolved: pytorch#154821 Approved by: https://github.com/XuehaiPan, https://github.com/Skylion007, https://github.com/yushangdi, https://github.com/angelayi

…I C-shim dispatching (pytorch#154371)" This reverts commit 65b1aed. Reverted pytorch#154371 on behalf of https://github.com/clee2000 due to see henry's comment above. This was reverted internally because it causes a memory leak and OOMs on AMD? ([comment](pytorch#154371 (comment)))

Vibe-coded with Codex, after collecting a backtrace, see https://chatgpt.com/s/cd_68438be8a1248191adbfa0a5f000e60b Even though, check for empty tensor list exists in `at::cat` crash might happens while resolving named dimension to position, by calling `dimname_to_position(tensors[0], dim)`, see backtrace below ``` (lldb) up frame #1: 0x00000001101146dc libtorch_cpu.dylib`at::TensorBase::has_names(this=0x0000000000000000) const at TensorBase.h:559:10 556 bool has_names() const { 557 // If a user is using unnamed tensors, then we can short-circuit right here. 558 // Otherwise, impl::has_names attempts to retrieve names. -> 559 if (!impl_->has_named_tensor_meta()) { 560 return false; 561 } 562 return impl::has_names(unsafeGetTensorImpl()); (lldb) up frame #2: 0x00000001101144c4 libtorch_cpu.dylib`at::dimname_to_position(tensor=0x0000000000000000, dim=Dimname @ 0x000000016fdfe348) at NamedTensorUtils.cpp:23:3 20 int64_t dimname_to_position(const Tensor& tensor, Dimname dim) { 21 TORCH_CHECK(dim.type() != NameType::WILDCARD, 22 "Please look up dimensions by name, got: name = None."); -> 23 TORCH_CHECK(tensor.has_names(), 24 "Name ", dim, " not found in ", toDimnameRepr(tensor), "."); 25 const auto names = tensor.names(); 26 ``` TODOs: - May be move test from `test_tensor_creation.py` to OpInfo (not sure which one is more readable) - Replace `TORCH_CHECK` with `TORCH_CHECK_VALUE` and adjust unit tests Fixes pytorch#155306 Pull Request resolved: pytorch#155383 Approved by: https://github.com/cyyever, https://github.com/ezyang ghstack dependencies: pytorch#155382

Fixes pytorch#155027 Converted RST files to Markdown Pull Request resolved: pytorch#155252 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <[email protected]>

…he organization of cuda docs (pytorch#155341) Fixes pytorch#150917 As mentioned in the issue, I've updated the documentation of `garbage_collection_threshold`and improved the organization. Could you please review? Pull Request resolved: pytorch#155341 Approved by: https://github.com/AlannaBurke, https://github.com/ngimel

…rch#155251) On this line, we see that the bw_compiler that dynamo uses for AotAutograd automatically disables the backward runnable: https://github.com/pytorch/pytorch/blob/05dd638ee98b36254c84095894c36fd0e7d95544/torch/_dynamo/backends/common.py#L76 This disables dynamo in the bw_compiler but also disables the runnable the compiler returns. On a AOTAutogradCache hit, however, we never call the bw_compiler! So we don't disable dynamo properly. This only has an effect on certain cases of cpu tensors' backwards, where the backward is being done in python land, and dynamo unnecessarily tries to trace through the inductor generated code. It also only matters if the backward is being accessed outside of dynamo itself (say, in a graph break in eager mode), since dynamo properly disables the forward function already. ``` I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] TorchDynamo attempted to trace the following frames: [ I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * fn /home/jjwu/test.py:9 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * cast /data/users/jjwu/a/pytorch-env/lib/python3.10/typing.py:1737 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] * call /tmp/torchinductor_jjwu/rq/crq327nhoyjzog5n3qlchauucdrunrtutwmmoh7ipoe2ngnson5s.py:35 I0605 09:58:40.135000 3981970 torch/_dynamo/eval_frame.py:517] ] ``` This PR fixes the issue and adds a unit test showing that with or without cache hit, the frames dynamo is tracing is identical. Fixes pytorch#154536 Pull Request resolved: pytorch#155251 Approved by: https://github.com/bdhirsh, https://github.com/anijain2305

…rch#137846)" This reverts commit c6b4f98. Reverted pytorch#137846 on behalf of https://github.com/etaf due to This is breaking tests on xpu, detail log: https://hud.pytorch.org/pr/pytorch/pytorch/154962#43700962849 ([comment](pytorch#137846 (comment)))

Summary: Moves Weights class to PyTorch core Torch Native Runtime RFC: pytorch/rfcs#72 README: https://github.com/pytorch/pytorch/blob/main/torch/nativert/OVERVIEW.md Test Plan: buck2 run mode/dev-nosan caffe2/test/cpp/nativert:weights_test Differential Revision: D75973156 Pull Request resolved: pytorch#155156 Approved by: https://github.com/zhxchen17

…ynamo/variables/tensor.py` (pytorch#153146) Part of pytorch#147913 Replace `unimplemented` with`unimplemented_v2` in `torch/_dynamo/variables/tensor.py` Pull Request resolved: pytorch#153146 Approved by: https://github.com/williamwen42 Co-authored-by: William Wen <[email protected]>

…vent logging (pytorch#154644) (pytorch#155268) Summary: **Problem Statement** Currently, torch distributed elastic does not support to an option specify destination for event logging from torch.distributed.run. *recording events to default destination:* https://fburl.com/code/7f9b0993 The default destination is "null". ***Solution*** adding option in torch.destributed.run to specify event_logging_destination. The default value will be "null" which is current default so it won;t affect users unless the specify it via command line. Test Plan: https://www.internalfb.com/mlhub/pipelines/runs/mast/f738408681-TrainingApplication_torch_distributed_run_3?job_attempt=0&version=0&tab=execution_details&env=PRODUCTION Rollback Plan: Reviewed By: kiukchung Differential Revision: D75183591 Pull Request resolved: pytorch#155268 Approved by: https://github.com/d4l3k

…ch#155413) When comparing two graphs exported using different opset versions, even though the value names are the same in both graphs, the node names did not match, causing model-explorer to not be able to sync the two graphs. This change updates the names of the nodes that directly produce the output values, for better correspondence across exported graphs. ![image](https://github.com/user-attachments/assets/3c00ca18-221f-4add-8429-4bcf12069036) Pull Request resolved: pytorch#155413 Approved by: https://github.com/cyyever, https://github.com/xadupre

…inter of tensor storage object (pytorch#154859) Summary: PyTorch execution trace records tensor storage data in the trace. The tensor storage data includes storage id, offset, number of elements, and number of byte for each element. PARAM et-replay uses this information to allocate/free the tensors. However, the current implementation of generating tensor storage id does not guarantee it is unique. ExecutionTraceObserver maintains a lookup table to map the memory address of the tensor storage object to an unique id. If a new memory address is found, it will be put into that hash table and associate it to a new id. This implementation does not guarantee the storage object is unique since the memory that the address points to may be released and then re-allocated to a different tensor storage object. Test Plan: buck2 run mode/opt caffe2/test:test_profiler_cuda -- profiler.test_execution_trace.TestExecutionTraceCUDA Differential Revision: D75749065 Pull Request resolved: pytorch#154859 Approved by: https://github.com/eellison, https://github.com/ngimel

Update the torch-xpu-ops commit to [intel/torch-xpu-ops@`a3a196`](intel/torch-xpu-ops@a3a196c) includes: - Enhanced Adaptive Average Pooling 2D Backward Kernel for performance and code simplification - Group Norm Backward Optimization with vectorization and parallel reduction - Support CL path for MaxUnpooling2d and MaxUnpooling3d - Rename USE_ONEMKL as USE_ONEMKL_XPU and set it as default ON - Refactor USE_XCCL & USE_C10D_XCCL option Pull Request resolved: pytorch#154962 Approved by: https://github.com/EikanWang

…shape behavior (pytorch#155257) Summary: test to unblock shampoo, needs cleanup Test Plan: CI Rollback Plan: steps: - jk.update: jk: pytorch/compiler:aliased_inputs_with_mutation_and_dyn_shapes_killswitch constant_bool: null consistent_pass_rate: null fractional_host_rollout: null sampling_rate: null - manual.note: content: Set it to false. Reviewed By: c00w Differential Revision: D76051868 Pull Request resolved: pytorch#155257 Approved by: https://github.com/c00w

…rch#155353) Original issue: pytorch#154820 Dedicated sub-issue: pytorch#155242 Backward graph is reordered by partitioners.py: reordering_to_mimic_autograd_engine Which only records in the backward graph compute that starts from tangents. Mutation of primals(inputs) in backward can be disconnected from backward. Handling this copy_ specifically, as we add this mutation in framework and this is the only mutation that exist. Pull Request resolved: pytorch#155353 Approved by: https://github.com/bdhirsh, https://github.com/zou3519

…54841)" This reverts commit e694280. Reverted pytorch#154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](pytorch#154841 (comment)))

Needed to support sparse operations on Blackwell, and implements new features for the library. Also optimizes library sizes vs 0.7 Pull Request resolved: pytorch#155232 Approved by: https://github.com/nWEIdia, https://github.com/malfet

…or XPU

Co-authored-by: Yu, Guangye <[email protected]>

Chao1Han force-pushed the xccl-on branch from edbcc44 to 4bbfc82 Compare June 3, 2025 02:17

Chao1Han changed the title ~~Build XCCL as default and make XCCL the default distributed backend f…~~ Use XCCL as default Jun 3, 2025

Chao1Han changed the title ~~Use XCCL as default~~ Enable USE_XCCL=ON by default when building PyTorch XPU binary Jun 3, 2025

Chao1Han force-pushed the xccl-on branch from 4bbfc82 to bb7fda6 Compare June 3, 2025 03:16

shink and others added 25 commits June 3, 2025 05:54

Combine sticky pgo key with job id (pytorch#154863)

ea5b9ec

Pull Request resolved: pytorch#154863 Approved by: https://github.com/Mingming-Ding

[BE] Use vendored packaging for testing (pytorch#154946)

e9266f8

As the rest of the torch uses it, test should rely on it as well Pull Request resolved: pytorch#154946 Approved by: https://github.com/cyyever, https://github.com/Skylion007

Analyze coalesced mem (pytorch#153730)

0adbde4

Analyze memory expressions to see if they contain a coalescing symbol. Pull Request resolved: pytorch#153730 Approved by: https://github.com/jansel ghstack dependencies: pytorch#153723

Turn on compile with NVSHMEM (pytorch#154538)

3685b10

Before: `USE_NVSHMEM=1` need to be explicit set in build environment. After: `USE_NVSHMEM=1` is the default for CUDA/Rocm on Linux. Pull Request resolved: pytorch#154538 Approved by: https://github.com/ngimel

Update bug-report.yml (pytorch#154857)

1f131fe

Update issue template for binary data and numerical notes. Fixes #ISSUE_NUMBER Pull Request resolved: pytorch#154857 Approved by: https://github.com/Skylion007, https://github.com/malfet

Add type annotation to orthogonal_ (pytorch#154927)

4014297

Trivial charge, but I want pyright to stop yelling at me Pull Request resolved: pytorch#154927 Approved by: https://github.com/cyyever, https://github.com/Skylion007

bobrenjc93 and others added 13 commits June 8, 2025 04:30

[BE] Polish Makefile (pytorch#155425)

49888e6

Pull Request resolved: pytorch#155425 Approved by: https://github.com/ezyang

Fix/issue pytorch#155027 (pytorch#155252)

d41f62b

Fixes pytorch#155027 Converted RST files to Markdown Pull Request resolved: pytorch#155252 Approved by: https://github.com/svekars Co-authored-by: Svetlana Karslioglu <[email protected]>

pytorchmergebot force-pushed the xccl-on branch from 7e52568 to fdba87d Compare June 9, 2025 05:08

yiming0416 and others added 15 commits June 9, 2025 05:49

Revert "Custom FX pass for inductor's backend registration (pytorch#1…

79bdafe

…54841)" This reverts commit e694280. Reverted pytorch#154841 on behalf of https://github.com/clee2000 due to failing some tests internally D76135706 ([comment](pytorch#154841 (comment)))

Build XCCL as default and make XCCL the default distributed backend f…

8b72f5e

…or XPU

Update torch/distributed/distributed_c10d.py

e38f914

Co-authored-by: Yu, Guangye <[email protected]>

restore register_backend change

e590a9a

Update CMakeLists.txt

0b47037

Update CMakeLists.txt

da27ae5

pytorchmergebot force-pushed the xccl-on branch from fdba87d to da27ae5 Compare June 10, 2025 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable USE_XCCL=ON by default when building PyTorch XPU binary #14

Enable USE_XCCL=ON by default when building PyTorch XPU binary #14

Uh oh!

Chao1Han commented Jun 3, 2025

Uh oh!

zhangxiaoli73 commented Jun 3, 2025

Uh oh!

Uh oh!

Enable USE_XCCL=ON by default when building PyTorch XPU binary #14

Are you sure you want to change the base?

Enable USE_XCCL=ON by default when building PyTorch XPU binary #14

Uh oh!

Conversation

Chao1Han commented Jun 3, 2025

Uh oh!

zhangxiaoli73 commented Jun 3, 2025

Uh oh!

Uh oh!