Replies: 1 comment 1 reply
-
You can check #386 , i did not mention it there, but the flashattention op in ggml only works with f16 k and v, so it is happening there. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Has anyone sucessfuly experimented with quantizing the latent image tensor to f16 or q8? I guess this could help a lot with generating high resolution images with limited memory, that is if f32 precision isn't needed, of course.
Beta Was this translation helpful? Give feedback.
All reactions