Skip to content

Controlling the number of threads for CLBlast/cuBLAS prompt processing #2498

Closed
@netrunnereve

Description

@netrunnereve

Currently the number of threads used for prompt processing and inference is defined by n_threads unless CPU-based BLAS is used. In that case it is locked to 1 for processing only since OpenBLAS and friends are already multithreaded to begin with.

https://github.com/ggerganov/llama.cpp/blob/8183159cf3def112f6d1fe94815fce70e1bffa12/llama.cpp#L1817-L1819

Meanwhile, GPU BLAS implementations spawn n_threads threads for prompt processing. Here's a CLBlast run with the default 4 threads on my 4 core/8 thread CPU:

llama_print_timings:        load time =   391.66 ms
llama_print_timings:      sample time =    78.83 ms /   100 runs   (    0.79 ms per token,  1268.50 tokens per second)
llama_print_timings: prompt eval time = 11181.22 ms /   401 tokens (   27.88 ms per token,    35.86 tokens per second)
llama_print_timings:        eval time = 21230.25 ms /    99 runs   (  214.45 ms per token,     4.66 tokens per second)
llama_print_timings:       total time = 32514.96 ms

I get around 28 ms/token during prompt processing. Now let's try running with a different thread count by modifying line 1819:

n_threads = N >= 32 && (ggml_cpu_has_blas() || ggml_cpu_has_gpublas()) ? <NUMBER OF THREADS> : n_threads;
Thread Count ms/token (averaged over multiple runs)
1 44
2 29
3 28
4 28
8 30

On the prompt processing side I'm able to get the same results with only 2 or 3 threads, which saves power and puts less load on the CPU. Meanwhile I get optimal inference speed with 4 threads (I use the CPU for inference as it's for some reason faster than the GPU), so there's a disrepancy there.

Is anyone else seeing this as well? I'm thinking of adding an additional command line option to control the prompt processing thread count.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions