Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852

RandomGitUser321 · 2024-03-03T11:47:12Z

I've been using the llava-v1.6-mistral-7b model for doing captions lately. I know it's relatively new and does some different things under the hood vs other/older vision models.

From the little bit of testing I've performed, it seems like server makes it fall back to the llava 1.5 vision, rather than using the 1.6 mode. When I check the total token counts, those always seem to be really low. This seems to affect any apps that use llama.cpp, like LM Studio and Jan. If I use llava-cli, with the same settings, the image alone encodes to 2880 tokens, which indicates that it's encoding the tiles correctly. Is there any way to make the server use llava-cli? Anyway to make llava-cli behave like a server? Am I doing something wrong?

I wrote a python program to batch caption folders of images, but I'm having to do it a really hacky way where it basically runs a command prompt behind the scenes, the python script captures the output of the window as a log, parses the log to trim out the non-response text, formats it, saves it, etc. The problem is that it's really annoying because it has to fully reload the model for each image.

For reference, this is how I'm running llava-cli:
llava-cli -m "C:\pathtomodel\llava-v1.6-mistral-7b.Q4_K_M.gguf" --mmproj "C:\pathtovision\mmproj-model-f16.gguf" --image "c:\pathtoimage\image.png" --temp 0.2 --n-gpu-layers 100 -n 2048 -c 4096 --mlock -p "<image>\nUSER:\nProvide a full description. Be as accurate and detailed as possible. \nASSISTANT:\n" >> log.txt (the >> log.txt is what you'd use if you were manually running it straight from a cmd prompt and not from some python script that can capture it for you)

The text was updated successfully, but these errors were encountered:

chigkim · 2024-03-04T12:27:06Z

You're probably running into the issue where It processes the image correctly but reports incorrect prompt token count.
#5863

RandomGitUser321 · 2024-03-04T16:05:32Z

Yeah you might be right. Within LM Studio, I tried starting a server using the llava 1.6 mistral model with a 2048 context. I used the basic python vision template, ran it, fed it an image and in the server log, it says: [ERROR] Failed to find a slot in the cache. Context too small?. Error Data: n/a, Additional Data: n/a. Seeing how the same image will show 2880 tokens in llava-cli, llama is probably working correctly under the hood and just misreporting the token counts.

This also then applies to when you're chatting within LM Studio. You'll see the same behaviour where an image barely uses any tokens. Might explain why it starts bugging out so easily while asking it more questions. You look down at the token meter and see something like 1000/4096 and it makes you think you've got more context room, but in reality, it's more like 2880+1000=3880/4096 and is right at the cusp of running out of room.

EDIT After even more testing:
I bumped my context size up to 8192 and can now have a conversation in LM Studio about two separate images, within the same conversation, without it bugging out and getting them mixed up. 2880+2880=5760 tokens, leaving 2432 tokens of room left. Meanwhile, the chat will say 221/8192 tokens. I loaded an image, said "describe with exactly 10 words" and of course it didn't listen, but the replies were only about 75 tokens long each.

So yeah, this definitely seems like the server is just misreporting and that it is actually using the correct 1.6 encoding of images under the hood, rather than falling back to 1.5.

cjpais · 2024-03-06T04:52:53Z

Two images is definitely an issue in the code for 1.6. I've poked around at it a bit but don't have clarity why.

Regarding misreporting the number of processed tokens it should be fixed in PR #5896

cjpais · 2024-03-06T18:06:12Z

Please note that with #5882 multimodal will be removed from server.

If you need the fix, use the code from #5896, just note that it will not be merged at this time

github-actions · 2024-04-20T01:06:57Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

RandomGitUser321 added the enhancement New feature or request label Mar 3, 2024

chigkim mentioned this issue Mar 4, 2024

support llava 1.6 image embedding dimension in server #5553

Merged

cjpais mentioned this issue Mar 6, 2024

server: multimodal - fix misreported prompt and num prompt tokens #5896

Closed

phymbert mentioned this issue Mar 12, 2024

server : improvements and maintenance #4216

Open

10 tasks

github-actions bot added the stale label Apr 6, 2024

github-actions bot closed this as completed Apr 20, 2024

cjpais mentioned this issue May 2, 2024

server: multimodal - fix misreported prompt and num prompt tokens Mozilla-Ocho/llamafile#392

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852

Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852

RandomGitUser321 commented Mar 3, 2024

chigkim commented Mar 4, 2024

Uh oh!

RandomGitUser321 commented Mar 4, 2024 •

edited

Loading

Uh oh!

cjpais commented Mar 6, 2024

Uh oh!

cjpais commented Mar 6, 2024

Uh oh!

github-actions bot commented Apr 20, 2024

Uh oh!

Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852

Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852

Comments

RandomGitUser321 commented Mar 3, 2024

chigkim commented Mar 4, 2024

Uh oh!

RandomGitUser321 commented Mar 4, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cjpais commented Mar 6, 2024

Uh oh!

cjpais commented Mar 6, 2024

Uh oh!

github-actions bot commented Apr 20, 2024

Uh oh!

RandomGitUser321 commented Mar 4, 2024 •

edited

Loading