mtmd : add support for Qwen2-Audio and SeaLLM-Audio #13760

ngxson · 2025-05-24T21:09:11Z

This PR adds support for:

Important

This model produces very poor result even on text-only tasks.

It is not practically usable, often hallucinates after a short amount of generated text (even on with f16 precision). However, I'm attempting to get it supported, so I can try out Qwen2.5-Omni next (only audio+image input, no audio output)

Due to the low quality, I think I'll skip uploading quant this time. Maybe I'll need to put up a notice in the multimodal.md (edit: added)

Also ref the newly created discussion: #13759

Some researches which made me conclude that the text model has problem:

It consistently goes bi-language (test both Qwen+SeaLLM, F16 model, same problem)

The official demo (non-llama.cpp) hallucinates content of the audio. I uploaded Martin Luther King Jr. "I have a dream" but it replies with nonsense music theory

Official demo: https://huggingface.co/spaces/Qwen/Qwen2-Audio-Instruct-Demo

Why I said "nonsense music theory"? Try playing it on the piano, it doesn't sound right 😂

I checked the encoder's logits from transformers and my implementation, they all matched up, so it's very likely that the text model has problem

~~The good thing is that, SeaLLM performs well on asian languages.~~ But ultravox does an even better job

Input audio: a recording about Hue

SeaLLM (response correctly, but misses a lot of information):

Ultravox (very detailed and accurate response):

ggerganov · 2025-05-25T07:51:58Z

tools/mtmd/clip.cpp

@@ -3450,6 +3485,10 @@ int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * im
        const int proj_stack_factor = ctx->vision_model.hparams.proj_stack_factor;
        const int n_len = CLIP_ALIGN(img->nx, proj_stack_factor);
        n_patches = n_len / proj_stack_factor / 2;
+    } else if (ctx->proj_type == PROJECTOR_TYPE_QWEN2A) {


nit: this should become a switch (no need to change in this PR)

ggerganov · 2025-05-25T07:53:44Z

tools/mtmd/clip.cpp

+        if (ctx->proj_type == PROJECTOR_TYPE_QWEN2A) {
+            ggml_tensor * cur = inpL;
+            cur = ggml_transpose(ctx0, cur);
+            cur = ggml_cast(ctx0, cur, GGML_TYPE_F32);


Any reason to prefer ggml_cast here over ggml_cont?

I had some problem with ggml_compute_forward_pool_1d and when looking into the source code, I mistakenly thought that it only supports F32. Changed to ggml_cont in e53a0dc

Note: I got this error without cast or cont, maybe we should assert the input to be contiguous:

==83831==ERROR: AddressSanitizer: BUS on unknown address (pc 0x000104b899ec bp 0x00016bbc1010 sp 0x00016bbc0fd0 T0) ==83831==The signal is caused by a WRITE memory access. ==83831==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used. #0 0x104b899ec in ggml_compute_forward_pool_1d ops.cpp:6395 #1 0x104ae9920 in ggml_graph_compute_thread ggml-cpu.c:2847 #2 0x104ae7e78 in ggml_graph_compute ggml-cpu.c:3138 #3 0x104aef350 in ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:172 #4 0x10542513c in ggml_backend_sched_graph_compute_async ggml-backend.cpp:1594 #5 0x105424500 in ggml_backend_sched_graph_compute ggml-backend.cpp:1578

zhouwg · 2025-05-25T09:04:39Z

ngxson, sorry to bother you,

have you any plan to add support for ByteDance's latest multimodal BAGEL(https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) in MTMD subsystem? as some tech media reported, the DeepSeek and Alibaba Qwen and Bytedance Seed and Tencent Hunyuan are the four most powerful/creative top AI labs/teams in China.
have you any plan to add support for Google's state-of-the-art gemma-3n in MTMD subsystem?

ngxson · 2025-05-25T10:18:40Z

Re. BAGEL, we don't have any plan for image generation support. Re. gemma 3n, sorry I can't tell you anything about this

zhouwg · 2025-05-25T11:12:13Z

thanks for your answer and thanks for your valuable time.

foldl · 2025-06-24T01:35:01Z

Does Qwen2-Audio perform poorly?

Audio

I came across this issue from doc of mtmd. I am curious about what's going on.

Since @ngxson had not shared the 30 second audio clip, I extracted part of the recording from here (see attachment) and tested it with chatllm.cpp. I think the result is not poor at all. The model is q8_0 quantized.

main.exe -m quantized\qwen2-audio-7b.bin -ngl all --multimedia-file-tags {{ }} -i
    ________          __  __    __    __  ___ 
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____  
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \ 
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ / 
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/  
You are served by Qwen2-Audio,                /_/   /_/       
with 7754878976 (7.8B) parameters.

You  > what is this? {{audio:.../MLKDream_64kb.wav}}
A.I. > This is the transcript of a speech given by an individual at an event.

You  > transcribe what he said
A.I. > The speech said was: 'Fivescore years ago a great American in whose symbolic shadow we stand today signed the Emancipation Proclamation.'

P.S. I think this well-known recording is probably in the training set.

MLKDream_64kb.zip

Text-only

I failed to re-produce the bi-language issue. Tested for two times, one with random sampling, one using greedy sampling. q8_0 quantization.

You  > write a very technical blog post about LLM
A.I. > Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture that is particularly well-suited for modeling sequential data. It was first introduced in 2009 by researchers at the University of Toronto, and has since become one of the most widely-used and successful architectures for a variety of natural language processing tasks.
One of the key advantages of LLMs is that they are able to effectively model long-term dependencies in sequential data. This is in contrast to traditional RNNs, which struggle with this task due to the vanishing gradient problem. The vanishing gradient problem occurs when the gradients of the loss function with respect to the model parameters become very small, making it difficult for the model to learn long-term dependencies.
LLMs address this problem by using a technique called "gated" units, which allow the model to selectively remember or forget information from previous time steps. This is achieved through the use of two gates: the "input gate" and the "output gate", which control the flow of information into and out of the cell state of the LSTM.
In addition to the gated units, LLMs also use a mechanism called "cell state" to maintain information over longer periods of time. The cell state is a hidden state that is passed from one time step to the next, and it is updated based on the input at the current time step and the previous cell state.
LLMs have been highly successful in a wide range of natural language processing tasks, including language modeling, machine translation, sentiment analysis, and question answering. They have also been used in a variety of other applications, such as speech recognition and computer vision.
Overall, LLMs are a powerful and versatile architecture for modeling sequential data, and they have had a significant impact on the field of natural language processing.

Greedy:

You  > write a very technical blog post about LLM
A.I. > Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture that is widely used for natural language processing tasks such as language modeling, machine translation, and sentiment analysis. It is particularly effective at handling long-term dependencies in sequential data, which is a common problem in natural language processing.
One of the key features of LLMs is their ability to maintain long-term dependencies by using a mechanism called "gates" that control the flow of information through the network. This allows the network to selectively remember or forget information from previous time steps, which is particularly useful for tasks that require the network to maintain information over long periods of time.
Another important feature of LLMs is their ability to handle variable-length sequences, which is a common problem in natural language processing tasks. This is achieved through the use of a dynamic memory cell that can store information of different lengths, allowing the network to handle variable-length sequences without the need for explicit padding or truncation.
In addition to these key features, LLMs also have a number of other advantages over other RNN architectures. For example, they are more computationally efficient, as they do not require the network to recompute the same values multiple times. They are also more robust to vanishing gradients, which can be a problem in traditional RNNs when training on long sequences.
Overall, LLMs are a powerful and versatile architecture that have proven to be highly effective for a wide range of natural language processing tasks. They are a key component of many state-of-the-art models in this field, and their ability to handle long-term dependencies and variable-length sequences makes them an essential tool for researchers and practitioners alike.

ngxson added 4 commits May 24, 2025 20:00

mtmd : add Qwen2-Audio support

fa4fed7

small clean up

06145bf

update discussion link

699684c

clarify mtmd_get_output_embd

9c192cf

ngxson requested a review from ggerganov May 24, 2025 21:09

github-actions bot added examples python python script changes labels May 24, 2025

clarification in multimodal.md

6687d38

github-actions bot added the documentation Improvements or additions to documentation label May 24, 2025

fix ultravox bug

c0bd810

ggerganov approved these changes May 25, 2025

View reviewed changes

ggml_cont

e53a0dc

ngxson merged commit 40aaa8a into ggml-org:master May 25, 2025
49 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mtmd : add support for Qwen2-Audio and SeaLLM-Audio #13760

mtmd : add support for Qwen2-Audio and SeaLLM-Audio #13760

Uh oh!

ngxson commented May 24, 2025 •

edited

Loading

Uh oh!

ggerganov May 25, 2025

Uh oh!

ggerganov May 25, 2025

Uh oh!

ngxson May 25, 2025 •

edited

Loading

Uh oh!

zhouwg commented May 25, 2025 •

edited

Loading

Uh oh!

ngxson commented May 25, 2025

Uh oh!

zhouwg commented May 25, 2025 •

edited

Loading

Uh oh!

Uh oh!

foldl commented Jun 24, 2025 •

edited

Loading

Uh oh!

Uh oh!

mtmd : add support for Qwen2-Audio and SeaLLM-Audio #13760

mtmd : add support for Qwen2-Audio and SeaLLM-Audio #13760

Uh oh!

Conversation

ngxson commented May 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

This model produces very poor result even on text-only tasks.

Uh oh!

ggerganov May 25, 2025

Choose a reason for hiding this comment

Uh oh!

ggerganov May 25, 2025

Choose a reason for hiding this comment

Uh oh!

ngxson May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhouwg commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented May 25, 2025

Uh oh!

zhouwg commented May 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

foldl commented Jun 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Does Qwen2-Audio perform poorly?

Audio

Text-only

Uh oh!

Uh oh!

ngxson commented May 24, 2025 •

edited

Loading

ngxson May 25, 2025 •

edited

Loading

zhouwg commented May 25, 2025 •

edited

Loading

zhouwg commented May 25, 2025 •

edited

Loading

foldl commented Jun 24, 2025 •

edited

Loading