Skip to content

Add option to build ggml-cuda as JIT using nvrtc #1154

Open
@WilliamTambellini

Description

@WilliamTambellini

Add option to build ggml-cuda as JIT using nvrtc.

As today ggml-cuda is AOT (Ahead Of Time):
all cuda kernels are compiled ahead of time using the local nvcc for a limited range of nvidia archs. This makes these embedded kernels only runnable on these archs. This also potentially makes ggml-cuda.so huge.

Another way to run device/npu/gpu kernels is to do JIT (Just in Time):
ggml-cuda to embed the source code of (some) kernels in the lib, link with nvrtc to compile the needed kernels/PTX, and only kernles which are needed, at runtime, just before the very first kernel execution.
Advantages: speeds up building ggml-cuda, limits the size of the lib, targets more archs dynamically, allows to parameter the kernels for the local hardware, and get better perf.
Drawbacks: the very first kernel exec is slower because needs a runtime compilation.
Refs:
https://docs.nvidia.com/cuda/nvrtc/
https://github.com/pytorch/pytorch/blob/b0a5d55c584792a504ec18600180e3d1200dfea6/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L1262
https://github.com/arrayfire/arrayfire/blob/360fefb3551a7c9f91250b0ec894aad76ec6a022/src/backend/cuda/compile_module.cpp#L153

@ggerganov
would you consider some PRs to add that option (no change of default behavior, still AOT/nvcc)?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions