Eval bug: Using llama-llava-clip-quantize-cli under CUDA backend conditions will encounter a crash.

### Name and Version

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA vGPU-32GB, compute capability 8.9, VMM: yes
version: 4954 (3cd3a395)
built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu

### Operating systems

Linux

### GGML backends

CUDA

### Hardware

4080S 32G

### Models

_No response_

### Problem description & steps to reproduce

If you use CUDA to compile llama.cpp, name will encounter a crash when using llama-llava-clip-quantize-cli to quantize the vision part of the clip. After checking, the error area is found in the figure below.
This is most likely an error caused by the inability to access memory in the GPU backend. It needs to be compiled into a CPU backend version before it can be executed. Have you encountered this problem?

```shell
./build/bin/llama-llava-clip-quantize-cli ~/autodl-tmp/llava-v1.5-7b/mmproj-model-f16.gguf ~/autodl-tmp/llava-v1.5-7b/mmproj-model-Q4_0.gguf 2
```

### First Bad Commit

_No response_

### Relevant log output

```shell
(llamacpp) root@autodl-container-1a0b499d52-72782394:~/llama.cpp# ./build/bin/llama-llava-clip-quantize-cli ~/autodl-tmp/llava-v1.5-7b/mmproj-model-f16.gguf ~/autodl-tmp/llava-v1.5-7b/mmproj-model-Q4_0.gguf 2
clip_init: model name:   BGE-VL-large
clip_init: description:  image encoder for LLaVA
clip_init: GGUF version: 3
clip_init: alignment:    32
clip_init: n_tensors:    377
clip_init: n_kv:         19
clip_init: ftype:        f16

clip_init: loaded meta data with 19 key-value pairs and 377 tensors from /root/autodl-tmp/llava-v1.5-7b/mmproj-model-f16.gguf
clip_init: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
clip_init: - kv   0:                       general.architecture str              = clip
clip_init: - kv   1:                      clip.has_text_encoder bool             = false
clip_init: - kv   2:                    clip.has_vision_encoder bool             = true
clip_init: - kv   3:                   clip.has_llava_projector bool             = true
clip_init: - kv   4:                          general.file_type u32              = 1
clip_init: - kv   5:                               general.name str              = BGE-VL-large
clip_init: - kv   6:                        general.description str              = image encoder for LLaVA
clip_init: - kv   7:                        clip.projector_type str              = mlp
clip_init: - kv   8:                     clip.vision.image_size u32              = 224
clip_init: - kv   9:                     clip.vision.patch_size u32              = 14
clip_init: - kv  10:               clip.vision.embedding_length u32              = 1024
clip_init: - kv  11:            clip.vision.feed_forward_length u32              = 4096
clip_init: - kv  12:                 clip.vision.projection_dim u32              = 768
clip_init: - kv  13:           clip.vision.attention.head_count u32              = 16
clip_init: - kv  14:   clip.vision.attention.layer_norm_epsilon f32              = 0.000010
clip_init: - kv  15:                    clip.vision.block_count u32              = 23
clip_init: - kv  16:                     clip.vision.image_mean arr[f32,3]       = [0.481455, 0.457828, 0.408211]
clip_init: - kv  17:                      clip.vision.image_std arr[f32,3]       = [0.268630, 0.261303, 0.275777]
clip_init: - kv  18:                              clip.use_gelu bool             = false
clip_init: - type  f32:  235 tensors
clip_init: - type  f16:  142 tensors
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA vGPU-32GB, compute capability 8.9, VMM: yes
clip_ctx: CLIP using CUDA0 backend
key clip.use_silu not found in file
clip_init: text_encoder:   0
clip_init: vision_encoder: 1
clip_init: llava_projector:  1
clip_init: minicpmv_projector:  0
clip_init: minicpmv_version:  2
clip_init: glm_projector:  0
clip_init: model size:     594.86 MB
clip_init: metadata size:  0.13 MB
clip_init: params backend buffer size =  594.86 MB (377 tensors)
key clip.vision.image_grid_pinpoints not found in file
key clip.vision.feature_layer not found in file
key clip.vision.mm_patch_merge_type not found in file
key clip.vision.image_crop_resolution not found in file

clip_init: vision model hparams
image_size         224
patch_size         14
v_hidden_size      1024
v_n_intermediate   4096
v_projection_dim   768
v_n_head           16
v_n_layer          23
v_eps              0.000010
v_image_mean       0.481455 0.457828 0.408211
v_image_std        0.268630 0.261303 0.275777
v_image_grid_pinpoints: 
v_vision_feature_layer: 
v_mm_patch_merge_type: flat
clip_init:      CUDA0 compute buffer size =     9.63 MiB
clip_init:        CPU compute buffer size =     1.58 MiB
Segmentation fault (core dumped)
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Eval bug: Using llama-llava-clip-quantize-cli under CUDA backend conditions will encounter a crash. #12564

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Eval bug: Using llama-llava-clip-quantize-cli under CUDA backend conditions will encounter a crash. #12564

Description

Name and Version

Operating systems

GGML backends

Hardware

Models

Problem description & steps to reproduce

First Bad Commit

Relevant log output

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions