Description
Prerequisites
Please answer the following questions for yourself before submitting an issue.
- I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- I carefully followed the README.md.
- I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed). Related issues I could find were: Trying to build a model of PY007/TinyLlama-1.1B-step-50K-105b #3018 and Cannot load Bloom-7b1 ggml model in GPU #3697
- I reviewed the Discussions, and have a new bug or useful enhancement to share.
Expected Behavior
Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp
to do.
Hi just started working with llama.cpp and I stumbled on this issue. Maybe its my side but I'm not really getting it working. I want to use the full 32 bit GGUF model converted from a pytorch model if possible without any more quantization. Can you help me or is this a bug? If any more information is required, let me know
- Convert model to 32 bit GGUF with
python3 convert.py ./models/tinyllama-1.1b-chat-v0.3
- Run this model using llama.cpp docker image
Current Behavior
Please provide a detailed written description of what llama.cpp
did, instead.
- Convert model to 32 bit GGUF with
python3 convert.py ./models/tinyllama-1.1b-chat-v0.3
(succeeds) - Run this model using llama.cpp docker image (fails)
while other bit level does work:
- Concert model to 16 bit GGUF with
python3 convert.py ./models/tinyllama-1.1b-chat-v0.3 --outtype f16
(succeeds) - Run this model using llama.cpp docker image (succeeds)
Environment and Context
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
- Physical (or virtual) hardware you are using, e.g. for Linux:
$ lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 7900X 12-Core Processor
CPU family: 25
Model: 97
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU max MHz: 5732,7139
CPU min MHz: 3000,0000
BogoMIPS: 9381.89
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc
rep_good amd_lbr_v2 nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f
16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr
_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdsee
d adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx51
2_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold av
ic v_vmsave_vmload vgif x2avic v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overfl
ow_recov succor smca fsrm flush_l1d
Virtualization features:
Virtualization: AMD-V
Caches (sum of all):
L1d: 384 KiB (12 instances)
L1i: 384 KiB (12 instances)
L2: 12 MiB (12 instances)
L3: 64 MiB (2 instances)
NUMA:
NUMA node(s): 1
NUMA node0 CPU(s): 0-23
NVIDIA 4090
- Operating System, e.g. for Linux:
$ uname -a
Linux GreenServer 6.2.0-34-generic #34~22.04.1-Ubuntu SMP PREEMPT_DYNAMIC Thu Sep 7 13:12:03 UTC 2 x86_64 x86_64 x86_64 GNU/Linux
- SDK version, e.g. for Linux:
$ python3 --version
Python 3.10.12
$ make --version
GNU Make 4.3
$ g++ --version
g++ (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Contaierfile:
FROM ghcr.io/ggerganov/llama.cpp:full-cuda
ENTRYPOINT ['./main']
Docker compose file:
services:
llama.cpp:
image: llama.cpp
container_name: llama.cpp-gpu
build:
context: '${PWD}/'
dockerfile: '${PWD}/Containerfile.orig'
volumes:
- '${PWD}/models:/models'
command: -m /models/TinyLLama/original/ggml-model-f32.gguf -p "Building a website can be done in 10 simple steps:" -n 1024 --seed 12345678 -t 2 --n-gpu-layers 99
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
Failure Information (for bugs)
Please help provide information about the failure / bug.
.GGML_ASSERT: ggml-cuda.cu:6115: false
Its unclear for me what this bug means.
Steps to Reproduce
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
- download pytorch_model.bin from https://huggingface.co/PY007/TinyLlama-1.1B-Chat-v0.3 in ./models/tinyllama-1.1b-chat-v0.3
python3 convert.py ./models/tinyllama-1.1b-chat-v0.3
- Run docker compose up
Failure Logs
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Log of 32 bit container
[+] Running 1/0
✔ Container llama.cpp-gpu Created 0.0s
Attaching to llama.cpp-gpu
llama.cpp-gpu | Log start
llama.cpp-gpu | main: build = 0 (unknown)
llama.cpp-gpu | main: built with cc (Ubuntu 11.3.0-1ubuntu1~22.04.1) 11.3.0 for x86_64-linux-gnu
llama.cpp-gpu | main: seed = 12345678
llama.cpp-gpu | ggml_init_cublas: found 1 CUDA devices:
llama.cpp-gpu | Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9
llama.cpp-gpu | llama_model_loader: loaded meta data with 20 key-value pairs and 201 tensors from /models/TinyLLama/original/TinyLlama-1.1B-Chat-v0.3/ggml-model-f32.gguf (version unknown)
llama.cpp-gpu | llama_model_loader: - tensor 0: token_embd.weight f32 [ 2048, 32003, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 1: blk.0.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 2: blk.0.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 3: blk.0.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 4: blk.0.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 5: blk.0.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 6: blk.0.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 7: blk.0.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 8: blk.0.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 9: blk.0.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 10: blk.1.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 11: blk.1.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 12: blk.1.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 13: blk.1.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 14: blk.1.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 15: blk.1.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 16: blk.1.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 17: blk.1.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 18: blk.1.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 19: blk.2.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 20: blk.2.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 21: blk.2.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 22: blk.2.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 23: blk.2.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 24: blk.2.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 25: blk.2.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 26: blk.2.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 27: blk.2.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 28: blk.3.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 29: blk.3.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 30: blk.3.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 31: blk.3.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 32: blk.3.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 33: blk.3.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 34: blk.3.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 35: blk.3.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 36: blk.3.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 37: blk.4.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 38: blk.4.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 39: blk.4.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 40: blk.4.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 41: blk.4.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 42: blk.4.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 43: blk.4.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 44: blk.4.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 45: blk.4.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 46: blk.5.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 47: blk.5.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 48: blk.5.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 49: blk.5.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 50: blk.5.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 51: blk.5.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 52: blk.5.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 53: blk.5.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 54: blk.5.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 55: blk.6.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 56: blk.6.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 57: blk.6.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 58: blk.6.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 59: blk.6.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 60: blk.6.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 61: blk.6.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 62: blk.6.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 63: blk.6.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 64: blk.7.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 65: blk.7.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 66: blk.7.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 67: blk.7.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 68: blk.7.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 69: blk.7.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 70: blk.7.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 71: blk.7.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 72: blk.7.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 73: blk.8.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 74: blk.8.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 75: blk.8.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 76: blk.8.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 77: blk.8.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 78: blk.8.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 79: blk.8.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 80: blk.8.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 81: blk.8.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 82: blk.9.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 83: blk.9.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 84: blk.9.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 85: blk.9.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 86: blk.9.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 87: blk.9.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 88: blk.9.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 89: blk.9.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 90: blk.9.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 91: blk.10.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 92: blk.10.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 93: blk.10.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 94: blk.10.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 95: blk.10.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 96: blk.10.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 97: blk.10.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 98: blk.10.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 99: blk.10.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 100: blk.11.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 101: blk.11.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 102: blk.11.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 103: blk.11.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 104: blk.11.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 105: blk.11.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 106: blk.11.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 107: blk.11.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 108: blk.11.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 109: blk.12.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 110: blk.12.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 111: blk.12.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 112: blk.12.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 113: blk.12.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 114: blk.12.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 115: blk.12.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 116: blk.12.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 117: blk.12.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 118: blk.13.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 119: blk.13.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 120: blk.13.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 121: blk.13.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 122: blk.13.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 123: blk.13.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 124: blk.13.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 125: blk.13.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 126: blk.13.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 127: blk.14.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 128: blk.14.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 129: blk.14.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 130: blk.14.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 131: blk.14.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 132: blk.14.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 133: blk.14.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 134: blk.14.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 135: blk.14.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 136: blk.15.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 137: blk.15.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 138: blk.15.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 139: blk.15.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 140: blk.15.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 141: blk.15.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 142: blk.15.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 143: blk.15.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 144: blk.15.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 145: blk.16.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 146: blk.16.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 147: blk.16.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 148: blk.16.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 149: blk.16.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 150: blk.16.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 151: blk.16.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 152: blk.16.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 153: blk.16.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 154: blk.17.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 155: blk.17.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 156: blk.17.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 157: blk.17.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 158: blk.17.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 159: blk.17.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 160: blk.17.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 161: blk.17.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 162: blk.17.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 163: blk.18.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 164: blk.18.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 165: blk.18.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 166: blk.18.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 167: blk.18.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 168: blk.18.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 169: blk.18.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 170: blk.18.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 171: blk.18.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 172: blk.19.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 173: blk.19.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 174: blk.19.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 175: blk.19.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 176: blk.19.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 177: blk.19.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 178: blk.19.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 179: blk.19.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 180: blk.19.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 181: blk.20.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 182: blk.20.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 183: blk.20.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 184: blk.20.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 185: blk.20.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 186: blk.20.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 187: blk.20.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 188: blk.20.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 189: blk.20.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 190: blk.21.attn_q.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 191: blk.21.attn_k.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 192: blk.21.attn_v.weight f32 [ 2048, 256, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 193: blk.21.attn_output.weight f32 [ 2048, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 194: blk.21.ffn_gate.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 195: blk.21.ffn_up.weight f32 [ 2048, 5632, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 196: blk.21.ffn_down.weight f32 [ 5632, 2048, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 197: blk.21.attn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 198: blk.21.ffn_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 199: output_norm.weight f32 [ 2048, 1, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - tensor 200: output.weight f32 [ 2048, 32003, 1, 1 ]
llama.cpp-gpu | llama_model_loader: - kv 0: general.architecture str
llama.cpp-gpu | llama_model_loader: - kv 1: general.name str
llama.cpp-gpu | llama_model_loader: - kv 2: llama.context_length u32
llama.cpp-gpu | llama_model_loader: - kv 3: llama.embedding_length u32
llama.cpp-gpu | llama_model_loader: - kv 4: llama.block_count u32
llama.cpp-gpu | llama_model_loader: - kv 5: llama.feed_forward_length u32
llama.cpp-gpu | llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama.cpp-gpu | llama_model_loader: - kv 7: llama.attention.head_count u32
llama.cpp-gpu | llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama.cpp-gpu | llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama.cpp-gpu | llama_model_loader: - kv 10: llama.rope.freq_base f32
llama.cpp-gpu | llama_model_loader: - kv 11: general.file_type u32
llama.cpp-gpu | llama_model_loader: - kv 12: tokenizer.ggml.model str
llama.cpp-gpu | llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama.cpp-gpu | llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama.cpp-gpu | llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama.cpp-gpu | llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
llama.cpp-gpu | llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
llama.cpp-gpu | llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32
llama.cpp-gpu | llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32
llama.cpp-gpu | llama_model_loader: - type f32: 201 tensors
llama.cpp-gpu | llm_load_print_meta: format = unknown
llama.cpp-gpu | llm_load_print_meta: arch = llama
llama.cpp-gpu | llm_load_print_meta: vocab type = SPM
llama.cpp-gpu | llm_load_print_meta: n_vocab = 32003
llama.cpp-gpu | llm_load_print_meta: n_merges = 0
llama.cpp-gpu | llm_load_print_meta: n_ctx_train = 2048
llama.cpp-gpu | llm_load_print_meta: n_embd = 2048
llama.cpp-gpu | llm_load_print_meta: n_head = 32
llama.cpp-gpu | llm_load_print_meta: n_head_kv = 4
llama.cpp-gpu | llm_load_print_meta: n_layer = 22
llama.cpp-gpu | llm_load_print_meta: n_rot = 64
llama.cpp-gpu | llm_load_print_meta: n_gqa = 8
llama.cpp-gpu | llm_load_print_meta: f_norm_eps = 0.0e+00
llama.cpp-gpu | llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llama.cpp-gpu | llm_load_print_meta: f_clamp_kqv = 0.0e+00
llama.cpp-gpu | llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llama.cpp-gpu | llm_load_print_meta: n_ff = 5632
llama.cpp-gpu | llm_load_print_meta: freq_base_train = 10000.0
llama.cpp-gpu | llm_load_print_meta: freq_scale_train = 1
llama.cpp-gpu | llm_load_print_meta: model type = ?B
llama.cpp-gpu | llm_load_print_meta: model ftype = all F32
llama.cpp-gpu | llm_load_print_meta: model params = 1.10 B
llama.cpp-gpu | llm_load_print_meta: model size = 4.10 GiB (32.00 BPW)
llama.cpp-gpu | llm_load_print_meta: general.name = models
llama.cpp-gpu | llm_load_print_meta: BOS token = 1 '<s>'
llama.cpp-gpu | llm_load_print_meta: EOS token = 2 '</s>'
llama.cpp-gpu | llm_load_print_meta: UNK token = 0 '<unk>'
llama.cpp-gpu | llm_load_print_meta: PAD token = 32000 '[PAD]'
llama.cpp-gpu | llm_load_print_meta: LF token = 13 '<0x0A>'
llama.cpp-gpu | llm_load_tensors: ggml ctx size = 0.07 MB
llama.cpp-gpu | llm_load_tensors: using CUDA for GPU acceleration
llama.cpp-gpu | llm_load_tensors: mem required = 250.09 MB
llama.cpp-gpu | llm_load_tensors: offloading 22 repeating layers to GPU
llama.cpp-gpu | llm_load_tensors: offloading non-repeating layers to GPU
llama.cpp-gpu | llm_load_tensors: offloaded 25/25 layers to GPU
llama.cpp-gpu | llm_load_tensors: VRAM used: 3946.38 MB
llama.cpp-gpu | .GGML_ASSERT: ggml-cuda.cu:6115: false
llama.cpp-gpu exited with code 139