-
Notifications
You must be signed in to change notification settings - Fork 12k
mtmd : add support for Qwen2-Audio and SeaLLM-Audio #13760
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -3450,6 +3485,10 @@ int clip_n_output_tokens(const struct clip_ctx * ctx, struct clip_image_f32 * im | |||
const int proj_stack_factor = ctx->vision_model.hparams.proj_stack_factor; | |||
const int n_len = CLIP_ALIGN(img->nx, proj_stack_factor); | |||
n_patches = n_len / proj_stack_factor / 2; | |||
} else if (ctx->proj_type == PROJECTOR_TYPE_QWEN2A) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: this should become a switch
(no need to change in this PR)
tools/mtmd/clip.cpp
Outdated
if (ctx->proj_type == PROJECTOR_TYPE_QWEN2A) { | ||
ggml_tensor * cur = inpL; | ||
cur = ggml_transpose(ctx0, cur); | ||
cur = ggml_cast(ctx0, cur, GGML_TYPE_F32); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to prefer ggml_cast
here over ggml_cont
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had some problem with ggml_compute_forward_pool_1d
and when looking into the source code, I mistakenly thought that it only supports F32. Changed to ggml_cont
in e53a0dc
Note: I got this error without cast or cont, maybe we should assert the input to be contiguous:
==83831==ERROR: AddressSanitizer: BUS on unknown address (pc 0x000104b899ec bp 0x00016bbc1010 sp 0x00016bbc0fd0 T0)
==83831==The signal is caused by a WRITE memory access.
==83831==Hint: this fault was caused by a dereference of a high value address (see register values below). Disassemble the provided pc to learn which register was used.
#0 0x104b899ec in ggml_compute_forward_pool_1d ops.cpp:6395
#1 0x104ae9920 in ggml_graph_compute_thread ggml-cpu.c:2847
#2 0x104ae7e78 in ggml_graph_compute ggml-cpu.c:3138
#3 0x104aef350 in ggml_backend_cpu_graph_compute(ggml_backend*, ggml_cgraph*) ggml-cpu.cpp:172
#4 0x10542513c in ggml_backend_sched_graph_compute_async ggml-backend.cpp:1594
#5 0x105424500 in ggml_backend_sched_graph_compute ggml-backend.cpp:1578
ngxson, sorry to bother you,
|
Re. BAGEL, we don't have any plan for image generation support. Re. gemma 3n, sorry I can't tell you anything about this |
thanks for your answer and thanks for your valuable time. |
This PR adds support for:
Important
This model produces very poor result even on text-only tasks.
It is not practically usable, often hallucinates after a short amount of generated text (even on with f16 precision). However, I'm attempting to get it supported, so I can try out Qwen2.5-Omni next (only audio+image input, no audio output)
Due to the low quality, I think I'll skip uploading quant this time. Maybe I'll need to put up a notice in the
multimodal.md
(edit: added)Also ref the newly created discussion: #13759
Some researches which made me conclude that the text model has problem:
It consistently goes bi-language (test both Qwen+SeaLLM, F16 model, same problem)
The official demo (non-llama.cpp) hallucinates content of the audio. I uploaded Martin Luther King Jr. "I have a dream" but it replies with nonsense music theory
Official demo: https://huggingface.co/spaces/Qwen/Qwen2-Audio-Instruct-Demo
Why I said "nonsense music theory"? Try playing it on the piano, it doesn't sound right 😂
I checked the encoder's logits from transformers and my implementation, they all matched up, so it's very likely that the text model has problem
The good thing is that, SeaLLM performs well on asian languages.But ultravox does an even better jobInput audio: a recording about Hue
SeaLLM (response correctly, but misses a lot of information):
Ultravox (very detailed and accurate response):