[Model] Activated LoRA #19710

tdoublep · 2025-06-16T20:31:01Z

Purpose

This PR adds support for Activated LoRA (a-LoRA): a new family of LoRA adapters that are invoked by including an invocation string in the prompt, and the weights are only adapted for the tokens in the sequence after the invocation string appears. This means that one can apply the aLoRA deep in a multi-turn interaction with the model without needing to recompute the entire KV cache. Instead, the adapter can use the KV cache from the base model right up until the adapter is invoked, thus significantly reducing TTFT.

paper: https://arxiv.org/abs/2504.12397
blog: https://research.ibm.com/blog/inference-friendly-aloras-lora

results from paper:

Implementation

We have tried to make the changes as unintrusive as possible (but happy to hear any suggestions for how the PR can be improved).

If one sets the --enable-activated-lora then the following happens:

At model loading time, we replace the QKV projection layers with equivalent aLoRA implementation (right now, the aLoRA weights change QKV projection layers only).
Each aLoRA request that comes in will be scanned for the invocation tokens and we store the invocation_start in the lora_request object.
When computing the hash of the blocks for prefix caching, the invocation_start information is used to determine whether base-model KV cache blocks can be re-used.
We introduce a simple ALoRAMetadata class that is needed to pass one mask tensor down to the LoRA layer.

We have tested that the integration works with:

Chunked prefill
Torch compile
Prefix caching

Test Plan

We have included an offline example using an uncertainty detection aLoRA .

If the community would like to have this feature in vLLM, we are happy to add more extensive unit and integration tests.

Test Result

I've included some debug print statements in the scheduler to illustrate explicitly the KV cache re-use when applying the aLoRA:

$ python examples/alora/alora_offline_example.py
...
INFO 06-19 14:17:13 [scheduler.py:427] request_id:          0
INFO 06-19 14:17:13 [scheduler.py:428] num_tokens:          12
INFO 06-19 14:17:13 [scheduler.py:429] num_computed_tokens: 0
INFO 06-19 14:17:13 [scheduler.py:430] num_new_tokens:      12
Prompt: '<|start_of_role|>user<|end_of_role|>What is MIT?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>', Generated text: '1. MIT, or Massachusetts Institute of Technology, is a prestigious private research university located in Cambridge, Massachusetts, USA.\n2. It was founded in 1861 and is known for its strong programs in science, technology, engineering, and mathematics (STEM).\n3. MIT is often ranked as one of the top universities globally and is a member of the Ivy League.\n4. It is renowned for its innovative research, influential faculty, and notable alumni.'
WARNING 06-19 14:17:14 [tokenizer.py:295] No tokenizer found in /home/zrltpa/.cache/huggingface/hub/models--ibm-granite--granite-3.2-8b-alora-uncertainty/snapshots/0d8ce48cdd4280a1e8fc37aa1de07537670ecf21, using base model tokenizer instead. (Exception: <class 'transformers.models.granite.configuration_granite.GraniteConfig'>)
INFO 06-19 14:17:14 [scheduler.py:427] request_id:          1
INFO 06-19 14:17:14 [scheduler.py:428] num_tokens:          139
INFO 06-19 14:17:14 [scheduler.py:429] num_computed_tokens: 128
INFO 06-19 14:17:14 [scheduler.py:430] num_new_tokens:      11
Time: 0.5810742378234863
Prompt: '<|start_of_role|>user<|end_of_role|>What is MIT?<|end_of_text|>\n<|start_of_role|>assistant<|end_of_role|>1. MIT, or Massachusetts Institute of Technology, is a prestigious private research university located in Cambridge, Massachusetts, USA.\n2. It was founded in 1861 and is known for its strong programs in science, technology, engineering, and mathematics (STEM).\n3. MIT is often ranked as one of the top universities globally and is a member of the Ivy League.\n4. It is renowned for its innovative research, influential faculty, and notable alumni.<|end_of_text|>\n<|start_of_role|>certainty<|end_of_role|>', Generated text: '85.75%'

(Optional) Documentation Update

tbd

github-actions · 2025-06-16T20:31:12Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

gemini-code-assist

Summary of Changes

Hello @tdoublep, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers the foundational components for Activated LoRA (aLoRA) within the vLLM framework. It enables dynamic application of LoRA weights based on prompt content, introduces necessary metadata structures, optimizes KV cache usage for these requests, and provides comprehensive examples for testing and deployment.

Highlights

Activated LoRA (aLoRA) Implementation: This PR introduces the initial working implementation of Activated LoRA (aLoRA), a technique that allows for selective application of LoRA weights based on specific 'invocation tokens' within the prompt. This enables dynamic switching between base model and LoRA weights during inference.
Core aLoRA Logic: The central mechanism for aLoRA is implemented in vllm/lora/layers.py, where a mask is dynamically generated based on k_offsets and query_start_locs. This mask determines which parts of the output should use the base model's computations and which should incorporate the LoRA adjustments, effectively blending the two outputs.
KV Cache Optimization for aLoRA: To optimize performance, the KV cache utility (vllm/v1/core/kv_cache_utils.py) has been updated to allow KV cache sharing for aLoRA requests. Specifically, the portion of the prompt before the aLoRA invocation tokens is treated as a base model request for caching purposes, reducing redundant computations.
Metadata and Request Handling: New fields (invocation_tokens, k_offset) have been added to LoRARequest to define the aLoRA activation. A new ALoRAMetadata dataclass is introduced in vllm/forward_context.py to pass these activation-specific details through the model's forward pass. The engine processor and GPU model runner are updated to extract and utilize this metadata, including tokenizing invocation strings from adapter configurations.
Testing and Examples: New example scripts (examples/alora/alora_server_testing.py, alora_server_testing.sh, new_alora_testing.py) are provided to demonstrate how to set up and interact with a vLLM server running aLoRA, both via the OpenAI-compatible API and directly through the vLLM Python API.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist is currently in preview and may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an initial implementation of Activated LoRA (aLoRA). The changes include adding new example scripts, modifying core components like the forward context, LoRA request, KV cache utilities, scheduler, processor, and GPU model runner to support aLoRA metadata extraction and application. The core logic for identifying the aLoRA invocation sequence and applying the mask seems correctly implemented. Feedback includes addressing a type mismatch in a metadata class, removing a debug print statement, and clarifying the purpose of layer registration in the compilation config.

vllm/forward_context.py

vllm/lora/layers.py

vllm/v1/worker/gpu_model_runner.py

vllm/model_executor/layers/linear.py

vllm/v1/worker/gpu_model_runner.py

vllm/v1/engine/processor.py

vllm/v1/core/sched/scheduler.py

vllm/model_executor/layers/linear.py

vllm/lora/layers.py

vllm/v1/worker/gpu_model_runner.py

Co-authored-by: Greenewald <[email protected]> Co-authored-by: Allison Li <[email protected]> Signed-off-by: Thomas Parnell <[email protected]>

Signed-off-by: Thomas Parnell <[email protected]>

mergify · 2025-06-19T04:40:10Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @tdoublep.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

…th torch.compile Signed-off-by: Thomas Parnell <[email protected]>

vllm/v1/core/kv_cache_utils.py

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep · 2025-06-19T15:23:07Z

vllm/entrypoints/openai/tool_parsers/xlam_tool_parser.py

+import regex as re
+


(this is needed due to some linting issue on main currently, unrelated to this PR)

… checking activated lora flag Signed-off-by: Thomas Parnell <[email protected]>

Signed-off-by: Thomas Parnell <[email protected]>

gemini-code-assist bot reviewed Jun 16, 2025

View reviewed changes

mergify bot added documentation Improvements or additions to documentation v1 labels Jun 16, 2025

gemini-code-assist bot reviewed Jun 16, 2025

View reviewed changes

tdoublep changed the title ~~[Model] Initial working implementation of Activated LoRA~~ [Model] Activated LoRA Jun 16, 2025

tdoublep commented Jun 17, 2025

View reviewed changes

Initial working implementation of a-LoRA.

8a9610e

Co-authored-by: Greenewald <[email protected]> Co-authored-by: Allison Li <[email protected]> Signed-off-by: Thomas Parnell <[email protected]>

tdoublep force-pushed the alora branch from 03e99b8 to 8a9610e Compare June 17, 2025 07:37

tdoublep added 11 commits June 17, 2025 07:39

Fix type hint for query_start_locs

a68e70b

Signed-off-by: Thomas Parnell <[email protected]>

vllm/model_executor/layers/linear.py: add comment on torch.compile

b254fb7

Signed-off-by: Thomas Parnell <[email protected]>

vllm/v1/worker/gpu_model_runner.py: remove print statement

3897b1b

Signed-off-by: Thomas Parnell <[email protected]>

vllm/v1/core/sched/scheduler.py: remove debug code

24ff376

Signed-off-by: Thomas Parnell <[email protected]>

vllm/envs.py

412eacd

Signed-off-by: Thomas Parnell <[email protected]>

Inject aLoRA behaviour via mixin

32098e4

Signed-off-by: Thomas Parnell <[email protected]>

Simpler implementation without mixin

fb6d28e

Signed-off-by: Thomas Parnell <[email protected]>

Scan for invocation tokens in one place

5f62d8b

Signed-off-by: Thomas Parnell <[email protected]>

Just use single field in request

f9396b0

Signed-off-by: Thomas Parnell <[email protected]>

Use peft_helper instead of reading files directly

6f36f6d

Signed-off-by: Thomas Parnell <[email protected]>

Remove online example for now.

c6ffe8f

Signed-off-by: Thomas Parnell <[email protected]>

mergify bot added the needs-rebase label Jun 19, 2025

Further simplification; works with chunked prefill; correct output wi…

4a4b568

…th torch.compile Signed-off-by: Thomas Parnell <[email protected]>

tdoublep commented Jun 19, 2025

View reviewed changes

vllm/v1/core/kv_cache_utils.py Show resolved Hide resolved

tdoublep added 3 commits June 19, 2025 14:10

Add enable_activated_lora engine arg

4cbef84

Signed-off-by: Thomas Parnell <[email protected]>

Disable tqdm in example

49a5bdc

Signed-off-by: Thomas Parnell <[email protected]>

Resolve merge conflicts

a9ac26d

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep marked this pull request as ready for review June 19, 2025 14:30

tdoublep requested review from WoosukKwon and robertgshaw2-redhat as code owners June 19, 2025 14:30

tdoublep requested review from njhill, ywang96, comaniac, alexm-redhat, aarnphm and jeejeelee as code owners June 19, 2025 14:30

mergify bot added frontend tool-calling and removed needs-rebase labels Jun 19, 2025

github-project-automation bot added this to Tool Calling Jun 19, 2025

Trigger Build

5c2e181

Signed-off-by: Thomas Parnell <[email protected]>

tdoublep commented Jun 19, 2025

View reviewed changes

tdoublep added 3 commits June 19, 2025 17:04

vllm/model_executor/layers/linear.py: check lora_config exists before…

ceae7c7

… checking activated lora flag Signed-off-by: Thomas Parnell <[email protected]>

arg_utils.py: fix typo

99b8b60

Signed-off-by: Thomas Parnell <[email protected]>

Additional checking of lora_config

477ab6e

Signed-off-by: Thomas Parnell <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[Model] Activated LoRA #19710

[Model] Activated LoRA #19710

tdoublep commented Jun 16, 2025 •

edited by github-actions bot

Loading

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jun 19, 2025

Uh oh!

Uh oh!

tdoublep Jun 19, 2025

Uh oh!

Uh oh!

		import regex as re

Uh oh!

[Model] Activated LoRA #19710

Are you sure you want to change the base?

[Model] Activated LoRA #19710

Conversation

tdoublep commented Jun 16, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Implementation

Test Plan

Test Result

(Optional) Documentation Update

Uh oh!

github-actions bot commented Jun 16, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jun 19, 2025

Uh oh!

Uh oh!

tdoublep Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tdoublep commented Jun 16, 2025 •

edited by github-actions bot

Loading