Description
Add option to build ggml-cuda as JIT using nvrtc.
As today ggml-cuda is AOT (Ahead Of Time):
all cuda kernels are compiled ahead of time using the local nvcc for a limited range of nvidia archs. This makes these embedded kernels only runnable on these archs. This also potentially makes ggml-cuda.so huge.
Another way to run device/npu/gpu kernels is to do JIT (Just in Time):
ggml-cuda to embed the source code of (some) kernels in the lib, link with nvrtc to compile the needed kernels/PTX, and only kernles which are needed, at runtime, just before the very first kernel execution.
Advantages: speeds up building ggml-cuda, limits the size of the lib, targets more archs dynamically, allows to parameter the kernels for the local hardware, and get better perf.
Drawbacks: the very first kernel exec is slower because needs a runtime compilation.
Refs:
https://docs.nvidia.com/cuda/nvrtc/
https://github.com/pytorch/pytorch/blob/b0a5d55c584792a504ec18600180e3d1200dfea6/torch/csrc/jit/tensorexpr/cuda_codegen.cpp#L1262
https://github.com/arrayfire/arrayfire/blob/360fefb3551a7c9f91250b0ec894aad76ec6a022/src/backend/cuda/compile_module.cpp#L153
@ggerganov
would you consider some PRs to add that option (no change of default behavior, still AOT/nvcc)?