Add option not to abort on cuda malloc errors

As today ggml force aborts the process whenever there is a cuda malloc failure: eg:

```
#2  0x00007f99c75ca66e in ggml_abort.cold () 
#3  0x00007f99c7b57882 in ggml_cuda_error(char const*, char const*, char const*, int, char const*) () 
#4  0x00007f99c7b5ae80 in ggml_cuda_mul_mat_batched_cublas(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#5  0x00007f99c7b648a6 in ggml_cuda_mul_mat(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*) ()
#6  0x00007f99c7b6a2e8 in ggml_backend_cuda_graph_compute(ggml_backend*, ggml_cgraph*)
#7  0x00007f99c7b21715 in ggml_backend_sched_graph_compute_async () from /home/nbuild/pub/xmt/latest/lib/libsdl-xnn-ggml.so

```
This is not ideal for some production context in which we need to have a controlled way to return an OOM error and exit/reload/resume/skip gracefully.
Would you mind if I:
- add an option (eg GGML_NO_ABORT_ON_OOM) to skip abort if malloc failures
- return a GGML_STATUS_ALLOC_FAILED to upper calls in the stack (ggml_cuda_mul_mat, ...) if cuda_malloc failed

?


Note: 
- ggml would still have same behavior as today: abort in all cases
- this would be just for malloc failures: would still abort in all other cases.

Best
W.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add option not to abort on cuda malloc errors #1083

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add option not to abort on cuda malloc errors #1083

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions