-
-
Notifications
You must be signed in to change notification settings - Fork 8.2k
[Model] Activated LoRA #19710
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Model] Activated LoRA #19710
Conversation
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @tdoublep, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request delivers the foundational components for Activated LoRA (aLoRA) within the vLLM framework. It enables dynamic application of LoRA weights based on prompt content, introduces necessary metadata structures, optimizes KV cache usage for these requests, and provides comprehensive examples for testing and deployment.
Highlights
- Activated LoRA (aLoRA) Implementation: This PR introduces the initial working implementation of Activated LoRA (aLoRA), a technique that allows for selective application of LoRA weights based on specific 'invocation tokens' within the prompt. This enables dynamic switching between base model and LoRA weights during inference.
- Core aLoRA Logic: The central mechanism for aLoRA is implemented in
vllm/lora/layers.py
, where a mask is dynamically generated based onk_offsets
andquery_start_locs
. This mask determines which parts of the output should use the base model's computations and which should incorporate the LoRA adjustments, effectively blending the two outputs. - KV Cache Optimization for aLoRA: To optimize performance, the KV cache utility (
vllm/v1/core/kv_cache_utils.py
) has been updated to allow KV cache sharing for aLoRA requests. Specifically, the portion of the prompt before the aLoRA invocation tokens is treated as a base model request for caching purposes, reducing redundant computations. - Metadata and Request Handling: New fields (
invocation_tokens
,k_offset
) have been added toLoRARequest
to define the aLoRA activation. A newALoRAMetadata
dataclass is introduced invllm/forward_context.py
to pass these activation-specific details through the model's forward pass. The engine processor and GPU model runner are updated to extract and utilize this metadata, including tokenizing invocation strings from adapter configurations. - Testing and Examples: New example scripts (
examples/alora/alora_server_testing.py
,alora_server_testing.sh
,new_alora_testing.py
) are provided to demonstrate how to set up and interact with a vLLM server running aLoRA, both via the OpenAI-compatible API and directly through the vLLM Python API.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces an initial implementation of Activated LoRA (aLoRA). The changes include adding new example scripts, modifying core components like the forward context, LoRA request, KV cache utilities, scheduler, processor, and GPU model runner to support aLoRA metadata extraction and application. The core logic for identifying the aLoRA invocation sequence and applying the mask seems correctly implemented. Feedback includes addressing a type mismatch in a metadata class, removing a debug print statement, and clarifying the purpose of layer registration in the compilation config.
Co-authored-by: Greenewald <[email protected]> Co-authored-by: Allison Li <[email protected]> Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
This pull request has merge conflicts that must be resolved before it can be |
…th torch.compile Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
import regex as re | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(this is needed due to some linting issue on main currently, unrelated to this PR)
… checking activated lora flag Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Signed-off-by: Thomas Parnell <[email protected]>
Purpose
This PR adds support for Activated LoRA (a-LoRA): a new family of LoRA adapters that are invoked by including an invocation string in the prompt, and the weights are only adapted for the tokens in the sequence after the invocation string appears. This means that one can apply the aLoRA deep in a multi-turn interaction with the model without needing to recompute the entire KV cache. Instead, the adapter can use the KV cache from the base model right up until the adapter is invoked, thus significantly reducing TTFT.
paper: https://arxiv.org/abs/2504.12397
blog: https://research.ibm.com/blog/inference-friendly-aloras-lora
results from paper:

Implementation
We have tried to make the changes as unintrusive as possible (but happy to hear any suggestions for how the PR can be improved).
If one sets the
--enable-activated-lora
then the following happens:invocation_start
in thelora_request
object.invocation_start
information is used to determine whether base-model KV cache blocks can be re-used.ALoRAMetadata
class that is needed to pass one mask tensor down to the LoRA layer.We have tested that the integration works with:
Test Plan
We have included an offline example using an uncertainty detection aLoRA .
If the community would like to have this feature in vLLM, we are happy to add more extensive unit and integration tests.
Test Result
I've included some debug print statements in the scheduler to illustrate explicitly the KV cache re-use when applying the aLoRA:
(Optional) Documentation Update
tbd