Closed
Description
Expected Behavior
I built a docker image (with adding #4211) and wanted to do a finetune inside the docker image. Llama.cpp otherwise works in docker for me.
Current Behavior
I ended up with CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered
Environment and Context
I use multiple GPUs (7 3090s with 24GB VRAM). The model does not fit in one, so I could not try if the problem persists with one device.
I built it like this:
- edit .devops/full-cuda.Dockerfile, change ARG CUDA_VERSION=11.8.0 to match the machine's cuda version
- apply Add finetune option to the docker image. #4211 to enable access to finetune from the docker image
docker build -t local/llama.cpp:full-cuda -f .devops/full-cuda.Dockerfile .
Then run the finetune:
docker run --gpus=all --cap-add SYS_RESOURCE -e USE_MLOCK=0 -e CUDA_VISIBLE_DEVICES=1,2,3,4,5,6,7 -v /home/user/llama.cpp/models:/var/model -t local/llama.cpp:full-cuda --finetune \
--model-base /var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf \
--checkpoint-in /var/model/chk-in-noushermes-13b-LATEST.gguf \
--checkpoint-out /var/model/chk-in-noushermes-13b-ITERATION.gguf \
--lora-out /var/model/lora-noushermes-13b-ITERATION.bin \
--train-data "/var/model/dataset.txt" \
--save-every 10 \
--threads 10 --adam-iter 30 --epochs 1 --batch 8 --ctx 256 \
--sample-start '<s>' \
--n-gpu-layers 999 \
--use-checkpointing(env)
(Tried with different CUDA_VISIBLE_DEVICES setups, such as 0,1,2). This one works with inference using main.
llama.cpp$ git log | head -1
commit e9c13ff78114af6fc6a4f27cc8dcdda0f3d389fb
The run looks like this:
main: seed: 1700876892
main: model base = '/var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf'
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6
Device 2: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 19 key-value pairs and 363 tensors from /var/model/NousResearch--Nous-Hermes-Llama2-13b.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor 0: token_embd.weight f16 [ 5120, 32032, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight f16 [ 5120, 5120, 1, 1 ]
[...]
llama_model_loader: - kv 18: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - type f32: 81 tensors
llama_model_loader: - type f16: 282 tensors
llm_load_vocab: special tokens definition check successful ( 291/32032 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32032
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 4096
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_head = 40
llm_load_print_meta: n_head_kv = 40
llm_load_print_meta: n_layer = 40
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 13824
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 4096
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 13B
llm_load_print_meta: model ftype = mostly F16
llm_load_print_meta: model params = 13.02 B
llm_load_print_meta: model size = 24.25 GiB (16.00 BPW)
llm_load_print_meta: general.name = LLaMA v2
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_tensors: ggml ctx size = 0.13 MiB
llm_load_tensors: using CUDA for GPU acceleration
ggml_cuda_set_main_device: using device 0 (NVIDIA GeForce RTX 3090) as main device
llm_load_tensors: mem required = 312.95 MiB
llm_load_tensors: offloading 40 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 43/43 layers to GPU
llm_load_tensors: VRAM used: 24514.39 MiB
...................................................................................................
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: offloading v cache to GPU
llama_kv_cache_init: offloading k cache to GPU
llama_kv_cache_init: VRAM kv self = 400.00 MiB
llama_new_context_with_model: kv self size = 400.00 MiB
llama_build_graph: non-view tensors processed: 924/924
llama_new_context_with_model: compute buffer total size = 78.07 MiB
llama_new_context_with_model: VRAM scratch buffer: 75.00 MiB
llama_new_context_with_model: total VRAM used: 24989.40 MiB (model: 24514.39 MiB, context: 475.00 MiB)
main: init model
print_params: n_vocab: 32032
print_params: n_ctx: 256
print_params: n_embd: 5120
print_params: n_ff: 13824
print_params: n_head: 40
print_params: n_head_kv: 40
print_params: n_layer: 40
print_params: norm_rms_eps : 0.000010
print_params: rope_freq_base : 10000.000000
print_params: rope_freq_scale : 1.000000
print_lora_params: n_rank_attention_norm : 1
print_lora_params: n_rank_wq : 4
print_lora_params: n_rank_wk : 4
print_lora_params: n_rank_wv : 4
print_lora_params: n_rank_wo : 4
print_lora_params: n_rank_ffn_norm : 1
print_lora_params: n_rank_w1 : 4
print_lora_params: n_rank_w2 : 4
print_lora_params: n_rank_w3 : 4
print_lora_params: n_rank_tok_embeddings : 4
print_lora_params: n_rank_norm : 1
print_lora_params: n_rank_output : 4
main: total train_iterations 0
main: seen train_samples 0
main: seen train_tokens 0
main: completed train_epochs 0
main: lora_size = 131432032 bytes (125.3 MB)
main: opt_size = 196306048 bytes (187.2 MB)
main: opt iter 0
main: input_size = 262414368 bytes (250.3 MB)
main: compute_size = 37813245024 bytes (36061.5 MB)
main: evaluation order = LEFT_TO_RIGHT
main: tokenize training data
tokenize_file: warning: found 144 samples (max length 567) that exceed context length of 256. samples will be cut off.
tokenize_file: warning: found 4691 samples (min length 35) that are shorter than context length of 256.
tokenize_file: total number of samples: 4836
main: number of training tokens: 634519
main: number of unique tokens: 10054
main: train data seems to have changed. restarting shuffled epoch.
main: begin training
main: work_size = 1281928 bytes (1.2 MB)
train_opt_callback: iter= 0 sample=1/4836 sched=0.000000 loss=0.000000 |->
CUDA error 700 at ggml-cuda.cu:6963: an illegal memory access was encountered
current device: 0