Replies: 1 comment
-
Hi there. Both llama-mtmd-cli and llama-server support audio for some models, models can be found at https://huggingface.co/collections/ggml-org/multimodal-ggufs Doc is here https://github.com/ggml-org/llama.cpp/blob/master/docs/multimodal.md See also #13759 #13760 and #13784 for some more details.
As you can see both Qwen2.5-Omni-3B and Qwen2.5-Omni-7B are supported for audio+vision Audio can be wav or mp3. For llama-mtmd-cli you can do : ./llama.cpp/bin/llama-mtmd-cli \
-m ./whatever/Qwen2.5-Omni-7B/Qwen2.5-Omni-7B-Q4_K_M.gguf \
--mmproj ./whatever/Qwen2.5-Omni-7B/mmproj-Qwen2.5-Omni-7B-Q8_0 \
--ctx-size 8192 \
--threads 8 \
-ngl 999 \
--temp 0.1 \
--top-p 0.8 \
--top-k 100 \
--repeat-penalty 1.05 \
--audio "/path/to/audio.wav" \
--prompt "transcribe audio" For llama-server you can do : ./llama.cpp/bin/llama-server \
-m ./whatever/Qwen2.5-Omni-7B/Qwen2.5-Omni-7B-Q4_K_M.gguf \
--mmproj ./whatever/Qwen2.5-Omni-7B/mmproj-Qwen2.5-Omni-7B-Q8_0.gguf \
--threads 8 \
-ngl 999 \
--host 0.0.0.0 \
--port 5000 \
--temp 0.1 \
--top-p 0.8 \
--top-k 100 \
--repeat-penalty 1.05 But you'd better try with Meta-Llama-3.1-8B-Instruct as results are much better : llama-mtmd-cli : ./llama.cpp/bin/llama-mtmd-cli \
-m ./whatever/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
--mmproj ./whatever/Meta-Llama-3.1-8B-Instruct/mmproj-ultravox-v0_5-llama-3_1-8b-f16.gguf \
--ctx-size 8192 \
--threads 8 \
-ngl 999 \
--temp 0.1 \
--top-p 0.8 \
--top-k 100 \
--repeat-penalty 1.05 \
--audio "/path/to/audio.wav" \
--prompt "transcribe audio" llama-server : ./llama.cpp/bin/llama-server \
-m ./whatever/Meta-Llama-3.1-8B-Instruct/Meta-Llama-3.1-8B-Instruct-Q5_K_M.gguf \
--mmproj ./whatever/Meta-Llama-3.1-8B-Instruct/mmproj-ultravox-v0_5-llama-3_1-8b-f16.gguf \
--ctx-size 8192 \
--threads 8 \
-ngl 999 \
--host 0.0.0.0 \
--port 5000 \
--temp 0.1 \
--top-p 0.8 \
--top-k 100 \
--repeat-penalty 1.05 |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
I saw that Qwen2.5-Omni now supports both image and audio inputs, which is great! I downloaded the following GGUF files:
Qwen2.5-Omni-3B-Q8_0.gguf (language model)
mmproj-Qwen2.5-Omni-3B-Q8_0.gguf (multimodal projector)
I'm trying to run local inference using llama-mtmd-cli from llama.cpp.
What is the correct command to run inference locally with an image or an audio input? Or is there a related readme that introduces the usage of the audio model?
Beta Was this translation helpful? Give feedback.
All reactions