Skip to content

mtmd : add support for Qwen2-Audio and SeaLLM-Audio #13760

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
May 25, 2025

Conversation

ngxson
Copy link
Collaborator

@ngxson ngxson commented May 24, 2025

This PR adds support for:

Important

This model produces very poor result even on text-only tasks.

It is not practically usable, often hallucinates after a short amount of generated text (even on with f16 precision). However, I'm attempting to get it supported, so I can try out Qwen2.5-Omni next (only audio+image input, no audio output)

Due to the low quality, I think I'll skip uploading quant this time. Maybe I'll need to put up a notice in the multimodal.md (edit: added)

Also ref the newly created discussion: #13759


Some researches which made me conclude that the text model has problem:

It consistently goes bi-language (test both Qwen+SeaLLM, F16 model, same problem)

image

image

The official demo (non-llama.cpp) hallucinates content of the audio. I uploaded Martin Luther King Jr. "I have a dream" but it replies with nonsense music theory

Official demo: https://huggingface.co/spaces/Qwen/Qwen2-Audio-Instruct-Demo

Why I said "nonsense music theory"? Try playing it on the piano, it doesn't sound right 😂

Screenshot 2025-05-24 at 21 29 58
I checked the encoder's logits from transformers and my implementation, they all matched up, so it's very likely that the text model has problem image
The good thing is that, SeaLLM performs well on asian languages. But ultravox does an even better job

Input audio: a recording about Hue

SeaLLM (response correctly, but misses a lot of information):

image

Ultravox (very detailed and accurate response):

image

@ngxson ngxson requested a review from ggerganov May 24, 2025 21:09
@github-actions github-actions bot added examples python python script changes labels May 24, 2025
@github-actions github-actions bot added the documentation Improvements or additions to documentation label May 24, 2025
@@ -3450,6 +3485,10 @@ int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * im
const int proj_stack_factor = ctx->vision_model.hparams.proj_stack_factor;
const int n_len = CLIP_ALIGN(img->nx, proj_stack_factor);
n_patches = n_len / proj_stack_factor / 2;
} else if (ctx->proj_type == PROJECTOR_TYPE_QWEN2A) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this should become a switch (no need to change in this PR)

if (ctx->proj_type == PROJECTOR_TYPE_QWEN2A) {
ggml_tensor * cur = inpL;
cur = ggml_transpose(ctx0, cur);
cur = ggml_cast(ctx0, cur, GGML_TYPE_F32);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to prefer ggml_cast here over ggml_cont?

Copy link
Collaborator Author

@ngxson ngxson May 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some problem with ggml_compute_forward_pool_1d and when looking into the source code, I mistakenly thought that it only supports F32. Changed to ggml_cont in e53a0dc

Note: I got this error without cast or cont, maybe we should assert the input to be contiguous:

==83831==ERROR: AddressSanitizer: BUS on unknown address (pc 0x000104b899ec bp 0x00016bbc1010 sp 0x00016bbc0fd0 T0)
==83831==The signal is caused by a WRITE memory access.
==83831==Hint: this fault was caused by a dereference of a high value address (see register values below).  Disassemble the provided pc to learn which register was used.
    #0 0x104b899ec in ggml_compute_forward_pool_1d ops.cpp:6395
    #1 0x104ae9920 in ggml_graph_compute_thread ggml-cpu.c:2847
    #2 0x104ae7e78 in ggml_graph_compute ggml-cpu.c:3138
    #3 0x104aef350 in ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:172
    #4 0x10542513c in ggml_backend_sched_graph_compute_async ggml-backend.cpp:1594
    #5 0x105424500 in ggml_backend_sched_graph_compute ggml-backend.cpp:1578

@zhouwg
Copy link
Contributor

zhouwg commented May 25, 2025

ngxson, sorry to bother you,

  • have you any plan to add support for ByteDance's latest multimodal BAGEL(https://huggingface.co/ByteDance-Seed/BAGEL-7B-MoT) in MTMD subsystem? as some tech media reported, the DeepSeek and Alibaba Qwen and Bytedance Seed and Tencent Hunyuan are the four most powerful/creative top AI labs/teams in China.
  • have you any plan to add support for Google's state-of-the-art gemma-3n in MTMD subsystem?

@ngxson
Copy link
Collaborator Author

ngxson commented May 25, 2025

Re. BAGEL, we don't have any plan for image generation support. Re. gemma 3n, sorry I can't tell you anything about this

@zhouwg
Copy link
Contributor

zhouwg commented May 25, 2025

thanks for your answer and thanks for your valuable time.

@ngxson ngxson merged commit 40aaa8a into ggml-org:master May 25, 2025
49 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation examples python python script changes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants