Description
Currently the number of threads used for prompt processing and inference is defined by n_threads
unless CPU-based BLAS is used. In that case it is locked to 1 for processing only since OpenBLAS and friends are already multithreaded to begin with.
Meanwhile, GPU BLAS implementations spawn n_threads
threads for prompt processing. Here's a CLBlast run with the default 4 threads on my 4 core/8 thread CPU:
llama_print_timings: load time = 391.66 ms
llama_print_timings: sample time = 78.83 ms / 100 runs ( 0.79 ms per token, 1268.50 tokens per second)
llama_print_timings: prompt eval time = 11181.22 ms / 401 tokens ( 27.88 ms per token, 35.86 tokens per second)
llama_print_timings: eval time = 21230.25 ms / 99 runs ( 214.45 ms per token, 4.66 tokens per second)
llama_print_timings: total time = 32514.96 ms
I get around 28 ms/token during prompt processing. Now let's try running with a different thread count by modifying line 1819:
n_threads = N >= 32 && (ggml_cpu_has_blas() || ggml_cpu_has_gpublas()) ? <NUMBER OF THREADS> : n_threads;
Thread Count | ms/token (averaged over multiple runs) |
---|---|
1 | 44 |
2 | 29 |
3 | 28 |
4 | 28 |
8 | 30 |
On the prompt processing side I'm able to get the same results with only 2 or 3 threads, which saves power and puts less load on the CPU. Meanwhile I get optimal inference speed with 4 threads (I use the CPU for inference as it's for some reason faster than the GPU), so there's a disrepancy there.
Is anyone else seeing this as well? I'm thinking of adding an additional command line option to control the prompt processing thread count.