Closed
Description
Feature Description
with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.
Motivation
Reduce memory use by Kv cache during long context batch inference
https://arxiv.org/abs/2402.02750
https://github.com/jy-yuan/KIVI
it was publish at reddit
https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/
Possible Implementation
https://github.com/jy-yuan/KIVI
I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.