-
Notifications
You must be signed in to change notification settings - Fork 1.2k
[WIP] CUDA backend #1983
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] CUDA backend #1983
Conversation
I wanna add rocm support based on your cuda pull request. would that be ok with you? |
Awesome progress so far @zcbenz !! I'm wondering what the best way to get this incorporated into MLX. I can think of a couple of options:
I kind of prefer the latter.. but I'm open to suggestions. |
@radudiaconu0 Of course I'm ok with it! Before you begin, you might want to decide how the ROCm backend lives together with CUDA backend first. I'm not familiar with ROCm, but I saw 2 patterns in projects with both backends:
Another thing to notice is this PR is bound to heavy changes in following weeks, I'm still experimenting what is the best interface for integration. |
Awesome progress indeed! Just chiming in regarding the best way to incorporate this. Imho merging often is the way to go (option 2 basically). Combined with running CUDA tests in CI it will be the easiest to live with (since we 'll know when we break it even if we don't use it). Otherwise the cuda branch would have to be constantly rebased on top of main which could be annoying. |
I would try to make a separate hip folder or to use hipify on your CUDA code to make it use rocm/hip |
I find myself keep refactoring the code when porting new kernels, I think I still need to implement a few more primitives before getting the backbone code stable, probably a few more weeks of experimenting. Once the code is ready for review, I can split this PR into a backbone PR, and a few small PRs for each primitive. And future works would then be submitted in incremental PRs. |
In CUDA the kernel parameters' size must be known at compile-time, i.e. we can't pass dynamic-sized shape/strides via constant memory like what the Metal kernels do. I'm currently passing shape/strides to kernels via fixed-size |
Sounds great! As long as we can change it by setting one number somewhere I think that's perfectly fine. |
c68b719
to
dc7f4f1
Compare
The C++ $ ./build/examples/cpp/logistic_regression
Loss array(0.0344943, dtype=float32), Accuracy, array(1, dtype=float32), Throughput 518.05 (it/s). |
The Each step takes about 2ms (i.e. 500 it/s), and each step consists of:
We can see that we spent as much time launching kernels as waiting for the results. Looking closer at the kernel launching part, between each Inside the Then look at the "CUDA HW" panel, which indicates that kernel running time is the same with There are very large paddings between the kernels, some of them belong to ops that do not need to launch kernels (like broadcast), some of them are the slow Finally there are long paddings between the Overall the overhead does not look very big: it is only about 2ms per step, and it should be a fixed number as it is not related to the size of arrays. However in the case of |
After a closer look, I think the "NVTX" row under "CUDA HW" only means to mark the event that started the kernel, and it does not indicate the event started at the same time with kernel. The kernels are started asynchronously. |
So in the case of For PyTorch the time duration between 2 simple ops is 5µs: And for us it is at least 41µs: There are a few things I can do to reduce the overhead:
At last there is still a good news though: the time spent on running kernel is the same with PyTorch, which means we don't need to improve the kernel implementation. |
Where are those calls coming from? Is it here? We might be able to reduce the number of times we call that if needed.. I'm not actually certain it needs to be there. I think it's just a mechanism to eagerly clean up unused events. |
Yes it is where the calls came from. Removing it would be great, I think it is an expensive op on all platforms. |
Tried the ideas: switching the implementation of The next optimization is tricky: after evaluating each op, the operands and temporaries are saved until kernel finishes running, in Metal it is done like this: mlx/mlx/backend/metal/metal.cpp Lines 55 to 71 in 60c4154
In CUDA there is a To get rid of this latency, I improved the CUDA backend by saving operands and temporaries of the op until The downside is the arrays take longer to be destroyed, which could increase memory usages. The code also no longer waits if there are more tasks than MAX_ACTIVE_TASKS. After this optimization the speed increased from 1100 it/s to 1600. There were still many kernels that get unusually delayed. What did the delayed kernels have in common? Before launching the kernel they all called an API: In the CUDA backend we use the unified memory APIs, which automatically transfers data between host and device, since I know the data is going to be used in GPU, I used the It turns out the prefetching heavily delayed the kernel executions. Removing prefetching increased speed from 1600 it/s to 2100, and we now have a really beautiful timeline in profiler. One optimization I haven't done yet is buffer cache: I will add it when most ops are implemented and there is not no more third party libraries to be integrated. Can we do better? The remaining are mostly hard work: optimize the kernels and make CPU code run faster, which I think should be visited after we have implemented all ops. |
Very nice @zcbenz !
That one is a bit concerning. For large graphs it can really blow up memory use if you hold the temporaries until the end of the graph eval. I don't think it's worth doing that. We might want to do something in-between like saving them once ever ~10 calls to |
I agree, this feels like an immature optimization for a special case. I'll make the behavior easy to configure so we can optimize for more cases in future. |
c3ce6e2
to
582817b
Compare
Hi @zcbenz,
I'm willing to collaborate, if you're interested, I can run builds on Jetson devices and report back and try tinkering to some degree (I'm not an expert in CUDA) |
@corupta Can you remove the On the slow compilation, you can pass |
tldr: it gives segfault
Some info about environments I use:
|
Thanks for testing the build, unfortunately this kind of error requires me to work the actual environment to debug. Currently I'm only testing on a few cloud environments, but I will look into making it work on Jetson once most development is done, the devices with hardware unified memory definitely need first class support. |
I'm closing this as it has been split into smaller pieces (check the linked PR above), future changes to CUDA backend will be submitted with incremental PRs. |
This PR is an ongoing effort to add a CUDA backend to MLX, very little things work now but you can run the tutorial example already.
To build and test:
For development I usually use:
$ cmake . -Bbuild -DMLX_BUILD_CUDA=ON -DMLX_BUILD_EXAMPLES=ON -DCMAKE_C_COMPILER_LAUNCHER=ccache -DCMAKE_CXX_COMPILER_LAUNCHER=ccache -DCMAKE_CUDA_COMPILER_LAUNCHER=ccache -DCMAKE_BUILD_TYPE=Debug -GNinja
Only tested on a Ubuntu 22.04 with CUDA 11.6, in theory other environments can also work but there are no testings.
This PR is not updated frequently, if anyone is interested in the realtime development, please check my forked repo.
There are mainly 2 reasons for a CUDA backend:
This work is sponsored by Apple.