Skip to content

Feature Request: llama-server support continue_final_message #11755

Closed
@DIYer22

Description

@DIYer22

Prerequisites

  • I am running the latest code. Mention the version if possible as well.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new and useful enhancement to share.

Feature Description

Both transformers and vLLM support continue_final_message parameter, which let model continue writing the last round message.

Description from vLLM:

     "If this is set, the chat will be formatted so that the final "
     "message in the chat is open-ended, without any EOS tokens. The "
     "model will continue this message rather than starting a new one. "
     "This allows you to \"prefill\" part of the model's response for it. "
     "Cannot be used at the same time as `add_generation_prompt`."

Hope llama-server support this feature too.

Motivation

  1. This is very helpful for user-controllable generation.
  2. When response is truncated by max_token, user can continue to generate longer response.

Possible Implementation

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions