Implement 4-bit quantized KV Cache for faster performance and to enable longer context

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

- [x] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I [searched using keywords relevant to my issue](https://docs.github.com/en/issues/tracking-your-work-with-issues/filtering-and-searching-issues-and-pull-requests) to make sure that I am creating a new issue that is not already open (or closed).
- [ x I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new bug or useful enhancement to share.

# Feature Description
A recent paper from UC berkley investigated 4-bit quantization of the KV cache for better performance and longer context. Given llama.cpp's emphasis on efficient inference particularly on CPU platforms through quantization, this seems right up llama.cpp's alley.

# Motivation

Better performance (it's possible to write custom CUDA kernels for 40% faster inference) and longer context are always beneficial to LLM users!

# Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

https://arxiv.org/abs/2401.18079


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Implement 4-bit quantized KV Cache for faster performance and to enable longer context #6863

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions