Skip to content

Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
RandomGitUser321 opened this issue Mar 3, 2024 · 5 comments
Labels
enhancement New feature or request stale

Comments

@RandomGitUser321
Copy link

I've been using the llava-v1.6-mistral-7b model for doing captions lately. I know it's relatively new and does some different things under the hood vs other/older vision models.

From the little bit of testing I've performed, it seems like server makes it fall back to the llava 1.5 vision, rather than using the 1.6 mode. When I check the total token counts, those always seem to be really low. This seems to affect any apps that use llama.cpp, like LM Studio and Jan. If I use llava-cli, with the same settings, the image alone encodes to 2880 tokens, which indicates that it's encoding the tiles correctly. Is there any way to make the server use llava-cli? Anyway to make llava-cli behave like a server? Am I doing something wrong?

I wrote a python program to batch caption folders of images, but I'm having to do it a really hacky way where it basically runs a command prompt behind the scenes, the python script captures the output of the window as a log, parses the log to trim out the non-response text, formats it, saves it, etc. The problem is that it's really annoying because it has to fully reload the model for each image.

For reference, this is how I'm running llava-cli:
llava-cli -m "C:\pathtomodel\llava-v1.6-mistral-7b.Q4_K_M.gguf" --mmproj "C:\pathtovision\mmproj-model-f16.gguf" --image "c:\pathtoimage\image.png" --temp 0.2 --n-gpu-layers 100 -n 2048 -c 4096 --mlock -p "<image>\nUSER:\nProvide a full description. Be as accurate and detailed as possible. \nASSISTANT:\n" >> log.txt (the >> log.txt is what you'd use if you were manually running it straight from a cmd prompt and not from some python script that can capture it for you)

@RandomGitUser321 RandomGitUser321 added the enhancement New feature or request label Mar 3, 2024
@chigkim
Copy link

chigkim commented Mar 4, 2024

You're probably running into the issue where It processes the image correctly but reports incorrect prompt token count.
#5863

@RandomGitUser321
Copy link
Author

RandomGitUser321 commented Mar 4, 2024

Yeah you might be right. Within LM Studio, I tried starting a server using the llava 1.6 mistral model with a 2048 context. I used the basic python vision template, ran it, fed it an image and in the server log, it says: [ERROR] Failed to find a slot in the cache. Context too small?. Error Data: n/a, Additional Data: n/a. Seeing how the same image will show 2880 tokens in llava-cli, llama is probably working correctly under the hood and just misreporting the token counts.

This also then applies to when you're chatting within LM Studio. You'll see the same behaviour where an image barely uses any tokens. Might explain why it starts bugging out so easily while asking it more questions. You look down at the token meter and see something like 1000/4096 and it makes you think you've got more context room, but in reality, it's more like 2880+1000=3880/4096 and is right at the cusp of running out of room.

EDIT After even more testing:
I bumped my context size up to 8192 and can now have a conversation in LM Studio about two separate images, within the same conversation, without it bugging out and getting them mixed up. 2880+2880=5760 tokens, leaving 2432 tokens of room left. Meanwhile, the chat will say 221/8192 tokens. I loaded an image, said "describe with exactly 10 words" and of course it didn't listen, but the replies were only about 75 tokens long each.

So yeah, this definitely seems like the server is just misreporting and that it is actually using the correct 1.6 encoding of images under the hood, rather than falling back to 1.5.

@cjpais
Copy link
Contributor

cjpais commented Mar 6, 2024

Two images is definitely an issue in the code for 1.6. I've poked around at it a bit but don't have clarity why.

Regarding misreporting the number of processed tokens it should be fixed in PR #5896

@cjpais
Copy link
Contributor

cjpais commented Mar 6, 2024

Please note that with #5882 multimodal will be removed from server.

If you need the fix, use the code from #5896, just note that it will not be merged at this time

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

3 participants