Feature Request: Prefix assistant answer

### Prerequisites

- [x] I am running the latest code. Mention the version if possible as well.
- [x] I carefully followed the [README.md](https://github.com/ggerganov/llama.cpp/blob/master/README.md).
- [x] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
- [x] I reviewed the [Discussions](https://github.com/ggerganov/llama.cpp/discussions), and have a new and useful enhancement to share.

### Feature Description

[Mistral's API](https://docs.mistral.ai/guides/prefix/) allows to prefix the answer of the assistant with a specified string. Excerpt from the documentation:

```python
    messages=[
        {"role": "system", "content": system},
        {"role": "user", "content": question},
        {"role": "assistant", "content": prefix, "prefix": True}, # <------- this line here is new
    ],
```

This makes it so that the next answer by the assistant starts with the given prefix.

### Motivation

The option to prefix the assistant's prompt gives a great deal of control over the generation of the model while being much simpler to use than the alternatives.

For example, to force the model to answer directly with code in Java with a specific function signature, the prefix could be `"```java\nint add(int x, int y){"`. This technique is used to generate code for benchmarks such as HumanEval to prevent the models from going of the rails.

### Possible Implementation

A full usage example could look something like this:

```python
# Example to generate a function named "quacksort".
# Currently, llama-server ignores the prefix and generates "quicksort" instead.
import requests

def does_not_work_yet():
    url = "http://localhost:8080/v1/chat/completions"

    prefix = "```go\nfunc quacksort"

    data =  {
        "messages": [
            {"role": "system", "content": "Only provide code. Do not write explanations."},
            {"role": "user", "content": "Implement quicksort."},
            {"role": "assistant", "content": prefix, "prefix": True}, # <----- this line here is new
        ],
        "seed": 0,
    }

    with requests.post(url, json=data) as response:
        content = response.json()["choices"][0]["message"]["content"]

    print(content)

if __name__ == "__main__":
    does_not_work_yet()
```

(I used the [qwen2.5-coder-7b-instruct-q3_k_m](https://huggingface.co/Qwen/Qwen2.5-Coder-7B-Instruct-GGUF/blob/main/qwen2.5-coder-7b-instruct-q3_k_m.gguf) model: `llama-server --model qwen2.5-coder-7b-instruct-q3_k_m.gguf --host 127.0.0.1 --port 8080`)

The expected result can be obtained with the raw completion API, but this is not portable from model to model since it requires knowledge of the prompt format, is more complicated and generally error prone since a single misplaced white space or line break can have significant impact on the generation quality.

```python

import requests

def works_but_ugly():
    url = "http://localhost:8080/completion"

    prefix = "```go\nfunc quacksort"

    prompt = f"""<|im_start|>system
Only provide code. Do not write explanations.<|im_end|>
<|im_start|>user
Implement quicksort.<|im_end|>
<|im_start|>assistant
{prefix}"""

    data = {
        "prompt": prompt,
        "seed": 0,
    }

    with requests.post(url, json=data) as response:
        content = prefix + response.json()["content"]

    print(content)

if __name__ == "__main__":
    works_but_ugly()
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature Request: Prefix assistant answer #11536

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Feature Request: Prefix assistant answer #11536

Description

Prerequisites

Feature Description

Motivation

Possible Implementation

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions