Skip to content

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache #5492

Closed
@sorasoras

Description

@sorasoras

Feature Description

with KV cache quantized in 2bits. This brings 2.6× less peak memory on the Llama/Mistral/Falcon models we evaluated while enabling 4x larger batch size, resulting in 2.35× - 3.47× throughput improvement.

Motivation

Reduce memory use by Kv cache during long context batch inference
https://arxiv.org/abs/2402.02750
https://github.com/jy-yuan/KIVI

it was publish at reddit
https://www.reddit.com/r/LocalLLaMA/comments/1ap3bkt/kv_cache_is_huge_and_bottlenecks_llm_inference_we/

Possible Implementation

https://github.com/jy-yuan/KIVI

I find it quite interesting, it might improve a lot for VRAM poor users even without large batch or long context.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions