Support Mooncake migration backend for PD disaggregation #3620

Risc-lt · 2025-06-07T15:34:47Z

PD-Disaggregated KVCache Transfer Pipeline with Mooncake

This PR introduces a new implementation of the Prefill-Decode disaggregated KVCache transfer pipeline with LMDeploy, using native Mooncake components of transfer engine as an option other than dlslime. The goal is to enable disaggregated prefill/decode workloads across nodes for large-scale LLM inference, inspired by lmdeploy-distserve. #3304 (comment)

Architecture Overview

Interfaces

The Mooncake migration backend implementation expose interfaces below:

p2p_initialize: Notify Prefill & Decode Engines to initilize migration backend instance of Mooncake transfer engine.
register_memory_region: Register memory region for the connection
endpoint_info: Return local memry pool and endpoint configuartion info.
p2p_connect: Recieve endpoint infomation from the other side of connecting nodes.
p2p_migrate: Set up conection for prefill-decode nodes and transfer kvcache synchronously in read mode.

Control Plane

Proxy server firstly use FastAPI post to send the endpoint info to notify the prefill-decode servers to send their local endpoint info to the other one through TCP socket. After p2p-connection is established, Mooncake migration backend start to transfer kvcache through RDMA link.

Workflow

Current Status

Functional validation on A10 with eRDMA as RoCEv2 support.
All basic PD workflows (initialize $\Rightarrow$ connect $\Rightarrow$ prefill $\Rightarrow$ migrate $\Rightarrow$ decode) goes well as previous version of dlslime.

Next Steps

Check migration addresses for validating the quality of ouput tokens.
Improve the kvcache transferring efficiency to surpass dlslime version.
Remove unecessary testing logs.

How to Build

pip install mooncake-transfer-engine
pip install -v -e .

How to Run

Start Proxy

lmdeploy serve proxy   --server-name <proxy-ip-address>   --server-port 8000   --routing-strategy "min_expected_latency"   --serving-strategy DistServe   --log-level INFO

Start Prefill Engine

CUDA_VISIBLE_DEVICES=0 lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct   --server-name <server-ip-address>  --server-port 23333  --role Prefill   --proxy-url <proxy-ip-address:port>  --backend pytorch  --migration-backend Mooncake

Start Decode Engine

CUDA_VISIBLE_DEVICES=1 lmdeploy serve api_server Qwen/Qwen2.5-7B-Instruct   --server-name <server-ip-address>  --server-port 23334   --role Decode   --proxy-url <proxy-ip-address:port>   --backend pytorch  --migration-backend Mooncake

Client Side

curl -X POST "<proxy-ip-address:port>/v1/completions" -H "Content-Type: application/json" -d '{"model":"Qwen/Qwen2.5-7B-Instruct","temperature":0,"prompt":"Shanghai is a city that ","max_tokens":16,"stream":false}'

stmatengss · 2025-06-08T08:45:59Z

@Risc-lt You can use tools like ruff or yapf to automatically fix code formatting and linting issues.

lvhan028 · 2025-06-08T12:04:37Z

The linting issue can be resolved by the following:

pip install pre-commit
cd lmdeploy # the root dir of lmdeploy repo
pre-commit run --all-files

Make sure that the python version is 3.10