This scheduler makes optimized routing decisions for inference requests to the llm-d inference framework.
This provides an "Endpoint Picker (EPP)" component to the llm-d inference framework which schedules incoming inference requests to the platform via a Kubernetes Gateway according to scheduler plugins (for more details, see the Architecture Documentation).
The EPP extends the Gateway API Inference Extension (GIE) project, which provides the API resources and machinery for scheduling. We add some custom features that are specific to llm-d here, such as P/D Disaggregation.
A compatible Gateway API implementation is used as the Gateway. The Gateway API implementation must utilize Envoy and support ext-proc, as this is the callback mechanism the EPP relies on to make routing decisions to model serving workloads currently.
Contributions are welcome!
For large changes please create an issue first describing the change so the maintainers can do an assessment, and work on the details with you. See DEVELOPMENT.md for details on how to work with the codebase.
Note that in general features should go to the upstream Gateway API Inference Extension (GIE) project first if applicable. The GIE is a major dependency of ours, and where most general purpose inference features live. If you have something that you feel is general purpose or use, it probably should go to the GIE. If you have something that's llm-d specific then it should go here. If you're not sure whether your feature belongs here or in the GIE, feel free to create a discussion or ask on Slack.