Skip to content

Support Mooncake migration backend for PD disaggregation #3620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Jun 20, 2025

Conversation

Risc-lt
Copy link
Contributor

@Risc-lt Risc-lt commented Jun 7, 2025

PD-Disaggregated KVCache Transfer Pipeline with Mooncake

This PR introduces a new implementation of the Prefill-Decode disaggregated KVCache transfer pipeline with LMDeploy, using native Mooncake components of transfer engine as an option other than dlslime. The goal is to enable disaggregated prefill/decode workloads across nodes for large-scale LLM inference, inspired by lmdeploy-distserve. #3304 (comment)


Architecture Overview

Interfaces

The Mooncake migration backend implementation expose interfaces below:

  • p2p_initialize: Notify Prefill & Decode Engines to initilize migration backend instance of Mooncake transfer engine.
  • register_memory_region: Register memory region for the connection
  • endpoint_info: Return local memry pool and endpoint configuartion info.
  • p2p_connect: Recieve endpoint infomation from the other side of connecting nodes.
  • p2p_migrate: Set up conection for prefill-decode nodes and transfer kvcache synchronously in read mode.

Control Plane

lmdeploy drawio

Proxy server firstly use FastAPI post to send the endpoint info to notify the prefill-decode servers to send their local endpoint info to the other one through TCP socket. After p2p-connection is established, Mooncake migration backend start to transfer kvcache through RDMA link.

Workflow

lmdeploy2 drawio


Current Status

  • Functional validation on A10 with eRDMA as RoCEv2 support.
  • All basic PD workflows (initialize $\Rightarrow$ connect $\Rightarrow$ prefill $\Rightarrow$ migrate $\Rightarrow$ decode) goes well as previous version of dlslime.

Next Steps

  • Check migration addresses for validating the quality of ouput tokens.
  • Improve the kvcache transferring efficiency to surpass dlslime version.
  • Remove unecessary testing logs.

How to Build

pip install mooncake-transfer-engine
pip install -v -e .

How to Run

Start Proxy

lmdeploy serve proxy   --server-name <proxy-ip-address>   --server-port 8000   --routing-strategy "min_expected_latency"   --serving-strategy DistServe   --log-level INFO

Start Prefill Engine

CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct   --server-name <server-ip-address>  --server-port 23333  --role Prefill   --proxy-url <proxy-ip-address:port>  --backend pytorch  --migration-backend Mooncake

Start Decode Engine

CUDA_VISIBLE_DEVICES=1 lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct   --server-name <server-ip-address>  --server-port 23334   --role Decode   --proxy-url <proxy-ip-address:port>   --backend pytorch  --migration-backend Mooncake

Client Side

curl -X POST "<proxy-ip-address:port>/v1/completions" -H "Content-Type: application/json" -d '{"model":"Qwen/Qwen2.5-7B-Instruct","temperature":0,"prompt":"Shanghai is a city that ","max_tokens":16,"stream":false}'

@stmatengss
Copy link

@Risc-lt You can use tools like ruff or yapf to automatically fix code formatting and linting issues.

@lvhan028
Copy link
Collaborator

lvhan028 commented Jun 8, 2025

The linting issue can be resolved by the following:

pip install pre-commit
cd lmdeploy # the root dir of lmdeploy repo
pre-commit run --all-files

Make sure that the python version is 3.10

@lvhan028 lvhan028 added the enhancement New feature or request label Jun 9, 2025
"""Initialize p2p connection for this specific link."""
# TODO: Support more types of metadata_server
# e.g. "etcd://192.168.0.137:2379"
metadata_server = 'P2PHANDSHAKE'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is metadata_server used for?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two modes: (1) 'P2PHANDSHAKE' (a magic string): no metadata server for maintaining connection information, which is intended for small-scale PD disaggregation, and (2) support for etcd/redis/http_server as the centralized server for larger-scale PD disaggregation.

try:
from mooncake.engine import TransferEngine
except ImportError as e:
raise ImportError('Please install mooncake by following the instructions at '
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When passing --migration-backend Mooncake, it's better to raise an import error immediately if Mooncake is not installed during API server launch.
Can we put it in the constructor of MooncakeBackend?

@Risc-lt
Copy link
Contributor Author

Risc-lt commented Jun 14, 2025

Having solved the problems above. NVLink support will be covered in next pr. cc @stmatengss @lvhan028 @JimyMa

@lvhan028 lvhan028 merged commit 6cc314a into InternLM:main Jun 20, 2025
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants