Skip to content

Add option not to abort on cuda malloc errors #1083

Open
@WilliamTambellini

Description

@WilliamTambellini

As today ggml force aborts the process whenever there is a cuda malloc failure: eg:

#2  0x00007f99c75ca66e in ggml_abort.cold () 
#3  0x00007f99c7b57882 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () 
#4  0x00007f99c7b5ae80 in ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#5  0x00007f99c7b648a6 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#6  0x00007f99c7b6a2e8 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#7  0x00007f99c7b21715 in ggml_backend_sched_graph_compute_async () from /home/nbuild/pub/xmt/latest/lib/libsdl-xnn-ggml.so

This is not ideal for some production context in which we need to have a controlled way to return an OOM error and exit/reload/resume/skip gracefully.
Would you mind if I:

  • add an option (eg GGML_NO_ABORT_ON_OOM) to skip abort if malloc failures
  • return a GGML_STATUS_ALLOC_FAILED to upper calls in the stack (ggml_cuda_mul_mat, ...) if cuda_malloc failed

?

Note:

  • ggml would still have same behavior as today: abort in all cases
  • this would be just for malloc failures: would still abort in all other cases.

Best
W.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions