Skip to content

mtmd : add support for Qwen2-Audio and SeaLLM-Audio #13760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 25, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 24, 2025

This PR adds support for:

Important

This model produces very poor result even on text-only tasks.

It is not practically usable, often hallucinates after a short amount of generated text (even on with f16 precision). However, I'm attempting to get it supported, so I can try out Qwen2.5-Omni next (only audio+image input, no audio output)

Due to the low quality, I think I'll skip uploading quant this time. Maybe I'll need to put up a notice in the multimodal.md (edit: added)

Also ref the newly created discussion: #13759


Some researches which made me conclude that the text model has problem:

It consistently goes bi-language (test both Qwen+SeaLLM, F16 model, same problem)

image

image

The official demo (non-llama.cpp) hallucinates content of the audio. I uploaded Martin Luther King Jr. "I have a dream" but it replies with nonsense music theory

Official demo: https://huggingface.co/spaces/Qwen/Qwen2-Audio-Instruct-Demo

Why I said "nonsense music theory"? Try playing it on the piano, it doesn't sound right 😂

Screenshot 2025-05-24 at 21 29 58
I checked the encoder's logits from transformers and my implementation, they all matched up, so it's very likely that the text model has problem image
The good thing is that, SeaLLM performs well on asian languages. But ultravox does an even better job

Input audio: a recording about Hue

SeaLLM (response correctly, but misses a lot of information):

image

Ultravox (very detailed and accurate response):

image

@ngxson ngxson requested a review from ggerganov May 24, 2025 21:09
@github-actions github-actions bot added examples python python script changes labels May 24, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 24, 2025
@@ -3450,6 +3485,10 @@ int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * im
const int proj_stack_factor = ctx->vision_model.hparams.proj_stack_factor;
const int n_len = CLIP_ALIGN(img->nx, proj_stack_factor);
n_patches = n_len / proj_stack_factor / 2;
} else if (ctx->proj_type == PROJECTOR_TYPE_QWEN2A) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should become a switch (no need to change in this PR)

if (ctx->proj_type == PROJECTOR_TYPE_QWEN2A) {
ggml_tensor * cur = inpL;
cur = ggml_transpose(ctx0, cur);
cur = ggml_cast(ctx0, cur, GGML_TYPE_F32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to prefer ggml_cast here over ggml_cont?

Copy link
Collaborator Author

@ngxson ngxson May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some problem with ggml_compute_forward_pool_1d and when looking into the source code, I mistakenly thought that it only supports F32. Changed to ggml_cont in e53a0dc

Note: I got this error without cast or cont, maybe we should assert the input to be contiguous:

==83831==ERROR: AddressSanitizer: BUS on unknown address (pc 0x000104b899ec bp 0x00016bbc1010 sp 0x00016bbc0fd0 T0)
==83831==The signal is caused by a WRITE memory access.
==83831==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0x104b899ec in ggml_compute_forward_pool_1d ops.cpp:6395
    #1 0x104ae9920 in ggml_graph_compute_thread ggml-cpu.c:2847
    #2 0x104ae7e78 in ggml_graph_compute ggml-cpu.c:3138
    #3 0x104aef350 in ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:172
    #4 0x10542513c in ggml_backend_sched_graph_compute_async ggml-backend.cpp:1594
    #5 0x105424500 in ggml_backend_sched_graph_compute ggml-backend.cpp:1578

@zhouwg
Copy link
Contributor

zhouwg commented May 25, 2025

ngxson, sorry to bother you,

  • have you any plan to add support for ByteDance's latest multimodal BAGEL(https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) in MTMD subsystem? as some tech media reported, the DeepSeek and Alibaba Qwen and Bytedance Seed and Tencent Hunyuan are the four most powerful/creative top AI labs/teams in China.
  • have you any plan to add support for Google's state-of-the-art gemma-3n in MTMD subsystem?

@ngxson
Copy link
Collaborator Author

ngxson commented May 25, 2025

Re. BAGEL, we don't have any plan for image generation support. Re. gemma 3n, sorry I can't tell you anything about this

@zhouwg
Copy link
Contributor

zhouwg commented May 25, 2025

thanks for your answer and thanks for your valuable time.

@ngxson ngxson merged commit 40aaa8a into ggml-org:master May 25, 2025
49 checks passed
@foldl
Copy link
Contributor

foldl commented Jun 24, 2025

Does Qwen2-Audio perform poorly?

Audio

I came across this issue from doc of mtmd. I am curious about what's going on.

Since @ngxson had not shared the 30 second audio clip, I extracted part of the recording from here (see attachment) and tested it with chatllm.cpp. I think the result is not poor at all. The model is q8_0 quantized.

main.exe -m quantized\qwen2-audio-7b.bin -ngl all --multimedia-file-tags {{ }} -i
    ________          __  __    __    __  ___ 
   / ____/ /_  ____ _/ /_/ /   / /   /  |/  /_________  ____  
  / /   / __ \/ __ `/ __/ /   / /   / /|_/ // ___/ __ \/ __ \ 
 / /___/ / / / /_/ / /_/ /___/ /___/ /  / // /__/ /_/ / /_/ / 
 \____/_/ /_/\__,_/\__/_____/_____/_/  /_(_)___/ .___/ .___/  
You are served by Qwen2-Audio,                /_/   /_/       
with 7754878976 (7.8B) parameters.

You  > what is this? {{audio:.../MLKDream_64kb.wav}}
A.I. > This is the transcript of a speech given by an individual at an event.

You  > transcribe what he said
A.I. > The speech said was: 'Fivescore years ago a great American in whose symbolic shadow we stand today signed the Emancipation Proclamation.'

P.S. I think this well-known recording is probably in the training set.

MLKDream_64kb.zip

Text-only

I failed to re-produce the bi-language issue. Tested for two times, one with random sampling, one using greedy sampling. q8_0 quantization.

You  > write a very technical blog post about LLM
A.I. > Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture that is particularly well-suited for modeling sequential data. It was first introduced in 2009 by researchers at the University of Toronto, and has since become one of the most widely-used and successful architectures for a variety of natural language processing tasks.
One of the key advantages of LLMs is that they are able to effectively model long-term dependencies in sequential data. This is in contrast to traditional RNNs, which struggle with this task due to the vanishing gradient problem. The vanishing gradient problem occurs when the gradients of the loss function with respect to the model parameters become very small, making it difficult for the model to learn long-term dependencies.
LLMs address this problem by using a technique called "gated" units, which allow the model to selectively remember or forget information from previous time steps. This is achieved through the use of two gates: the "input gate" and the "output gate", which control the flow of information into and out of the cell state of the LSTM.
In addition to the gated units, LLMs also use a mechanism called "cell state" to maintain information over longer periods of time. The cell state is a hidden state that is passed from one time step to the next, and it is updated based on the input at the current time step and the previous cell state.
LLMs have been highly successful in a wide range of natural language processing tasks, including language modeling, machine translation, sentiment analysis, and question answering. They have also been used in a variety of other applications, such as speech recognition and computer vision.
Overall, LLMs are a powerful and versatile architecture for modeling sequential data, and they have had a significant impact on the field of natural language processing.

Greedy:

You  > write a very technical blog post about LLM
A.I. > Long Short-Term Memory (LSTM) is a type of recurrent neural network (RNN) architecture that is widely used for natural language processing tasks such as language modeling, machine translation, and sentiment analysis. It is particularly effective at handling long-term dependencies in sequential data, which is a common problem in natural language processing.
One of the key features of LLMs is their ability to maintain long-term dependencies by using a mechanism called "gates" that control the flow of information through the network. This allows the network to selectively remember or forget information from previous time steps, which is particularly useful for tasks that require the network to maintain information over long periods of time.
Another important feature of LLMs is their ability to handle variable-length sequences, which is a common problem in natural language processing tasks. This is achieved through the use of a dynamic memory cell that can store information of different lengths, allowing the network to handle variable-length sequences without the need for explicit padding or truncation.
In addition to these key features, LLMs also have a number of other advantages over other RNN architectures. For example, they are more computationally efficient, as they do not require the network to recompute the same values multiple times. They are also more robust to vanishing gradients, which can be a problem in traditional RNNs when training on long sequences.
Overall, LLMs are a powerful and versatile architecture that have proven to be highly effective for a wide range of natural language processing tasks. They are a key component of many state-of-the-art models in this field, and their ability to handle long-term dependencies and variable-length sequences makes them an essential tool for researchers and practitioners alike.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants