-
Notifications
You must be signed in to change notification settings - Fork 12k
Question about llama.cpp and llava-cli when used with llava 1.6 for vision: #5852
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
You're probably running into the issue where It processes the image correctly but reports incorrect prompt token count. |
Yeah you might be right. Within LM Studio, I tried starting a server using the llava 1.6 mistral model with a 2048 context. I used the basic python vision template, ran it, fed it an image and in the server log, it says: This also then applies to when you're chatting within LM Studio. You'll see the same behaviour where an image barely uses any tokens. Might explain why it starts bugging out so easily while asking it more questions. You look down at the token meter and see something like 1000/4096 and it makes you think you've got more context room, but in reality, it's more like 2880+1000=3880/4096 and is right at the cusp of running out of room. EDIT After even more testing: So yeah, this definitely seems like the server is just misreporting and that it is actually using the correct 1.6 encoding of images under the hood, rather than falling back to 1.5. |
Two images is definitely an issue in the code for 1.6. I've poked around at it a bit but don't have clarity why. Regarding misreporting the number of processed tokens it should be fixed in PR #5896 |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I've been using the llava-v1.6-mistral-7b model for doing captions lately. I know it's relatively new and does some different things under the hood vs other/older vision models.
From the little bit of testing I've performed, it seems like server makes it fall back to the llava 1.5 vision, rather than using the 1.6 mode. When I check the total token counts, those always seem to be really low. This seems to affect any apps that use llama.cpp, like LM Studio and Jan. If I use llava-cli, with the same settings, the image alone encodes to 2880 tokens, which indicates that it's encoding the tiles correctly. Is there any way to make the server use llava-cli? Anyway to make llava-cli behave like a server? Am I doing something wrong?
I wrote a python program to batch caption folders of images, but I'm having to do it a really hacky way where it basically runs a command prompt behind the scenes, the python script captures the output of the window as a log, parses the log to trim out the non-response text, formats it, saves it, etc. The problem is that it's really annoying because it has to fully reload the model for each image.
For reference, this is how I'm running llava-cli:
llava-cli -m "C:\pathtomodel\llava-v1.6-mistral-7b.Q4_K_M.gguf" --mmproj "C:\pathtovision\mmproj-model-f16.gguf" --image "c:\pathtoimage\image.png" --temp 0.2 --n-gpu-layers 100 -n 2048 -c 4096 --mlock -p "<image>\nUSER:\nProvide a full description. Be as accurate and detailed as possible. \nASSISTANT:\n" >> log.txt
(the >> log.txt is what you'd use if you were manually running it straight from a cmd prompt and not from some python script that can capture it for you)The text was updated successfully, but these errors were encountered: