Releases · vllm-project/vllm

07 Jul 17:05

github-actions

v0.9.2

a5dd03c

v0.9.2 Latest

Latest

Highlights

This release contains 452 commits from 167 contributors (31 new!)

NOTE: This is the last version where V0 engine code and features stay intact. We highly recommend migrating to V1 engine.

Engine Core

Priority Scheduling is now implemented in V1 engine (#19057), embedding models in V1 (#16188), Mamba2 in V1 (#19327).
Full CUDA‑Graph execution is now available for all FlashAttention v3 (FA3) and FlashMLA paths, including prefix‑caching. CUDA graph now has a live capture progress bar makes debugging easier (#20301, #18581, #19617, #19501).
FlexAttention update – any head size, FP32 fallback (#20467, #19754).
Shared CachedRequestData objects and cached sampler‑ID stores deliver perf enhancements (#20232, #20291).

Model Support

New families: Ernie 4.5 (+MoE) (#20220), MiniMax‑M1 (#19677, #20297), Slim‑MoE “Phi‑tiny‑MoE‑instruct” (#20286), Tencent HunYuan‑MoE‑V1 (#20114), Keye‑VL‑8B‑Preview (#20126), GLM‑4.1 V (#19331), Gemma‑3 (text‑only, #20134), Tarsier 2 (#19887), Qwen 3 Embedding & Reranker (#19260), dots1 (#18254), GPT‑2 for Sequence Classification (#19663).
Granite hybrid MoE configurations with shared experts are fully supported (#19652).

Large‑Scale Serving & Engine Improvements

Expert‑Parallel Load Balancer (EPLB) has been added! (#18343, #19790, #19885).
Disaggregated serving enhancements: Avoid stranding blocks in P when aborted in D's waiting queue (#19223), let toy proxy handle /chat/completions (#19730)
Native xPyD P2P NCCL transport as a base case for native PD without external dependency (#18242, #20246).

Hardware & Performance

NVIDIA Blackwell
- SM120: CUTLASS W8A8/FP8 kernels and related tuning, added to Dockerfile (#17280, #19566, #20071, #19794)
- SM100: block‑scaled‑group GEMM, INT8/FP8 vectorization, deep‑GEMM kernels, activation‑chunking for MoE, and group‑size 64 for Machete (#19757, #19572, #19168, #19085, #20290, #20331).
Intel GPU (V1) backend with Flash‑Attention support (#19560).
AMD ROCm: full‑graph capture for TritonAttention, quick All‑Reduce, and chunked pre‑fill (#19158, #19744, #18596).
- Split‑KV support landed in the unified Triton Attention kernel, boosting long‑context throughput (#19152).
- Full‑graph mode enabled in ROCm AITER MLA V1 decode path (#20254).
TPU: dynamic‑grid KV‑cache updates, head‑dim less than 128, tuned paged‑attention kernels, and KV‑padding fixes (#19928, #20235, #19620, #19813, #20048, #20339).
- Add models and features supporting matrix. (#20230)

Quantization

Calibration‑free RTN INT4/INT8 pipeline for effortless, accurate compression (#18768).
Compressed‑Tensor NVFP4 (including MoE) + emulation; FP4 emulation removed on < SM100 devices (#19879, #19990, #19563).
Dynamic MoE‑layer quant (Marlin/GPTQ) and INT8 vectorization primitives (#19395, #20331, #19233).
Bits‑and‑Bytes 0.45 + with improved double‑quant logic and AWQ quality (#20424, #20033, #19431, #20076).

API · CLI · Frontend

API Server: Eliminate api_key and x_request_id headers middleware overhead (#19946)
New OpenAI‑compatible endpoints: /v1/audio/translations & revamped /v1/audio/transcriptions (#19615, #20179, #19597).
Token‑level progress bar for LLM.beam_search and cached template‑resolution speed‑ups (#19301, #20065).
Image‑object support in llm.chat, tool‑choice expansion, and custom‑arg passthroughs enrich multi‑modal agents (#19635, #17177, #16862).
CLI QoL: better parsing for -O/--compilation-config, batch‑size‑sweep benchmarking, richer --help, faster startup (#20156, #20516, #20430, #19941).
Metrics: Deprecate metrics with gpu_ prefix for non GPU specific metrics (#18354), Export NaNs in logits to scheduler_stats if output is corrupted (#18777)

Platform & Deployment

No‑privileged CPU / Docker / K8s mode (#19241) and custom default max‑tokens for hosted platforms (#18557).
Security hardening – runtime (cloud)pickle imports forbidden (#18018).
Hermetic builds and wheel slimming (FA2 8.0 + PTX only) shrink supply‑chain surface (#18064, #19336).

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
[doc] Make top navigatio...

Contributors

rasmith, shawntan, and 166 other contributors

Assets 6

06 Jul 21:03

github-actions

v0.9.2rc2

a5dd03c

v0.9.2rc2 Pre-release

Pre-release

What's Changed

[Kernel] Enable fp8 support for pplx and BatchedTritonExperts. by @bnellnm in #18864
[Misc] Fix Unable to detect current VLLM config. Defaulting to NHD kv cache layout warning by @NickLucche in #20400
[Bugfix] Register reducer even if transformers_modules not available by @eicherseiji in #19510
Change warn_for_unimplemented_methods to debug by @mgoin in #20455
[Platform] Add custom default max tokens by @gmarinho2 in #18557
Add ignore consolidated file in mistral example code by @princepride in #20420
[Misc] small update by @reidliu41 in #20462
[Structured Outputs][V1] Skipping with models doesn't contain tokenizers by @aarnphm in #20365
[Perf] Optimize Vectorization Utils for Int 8 Quantization Kernels by @yewentao256 in #20331
[Misc] Add SPDX-FileCopyrightText by @jeejeelee in #20428
Support Llama 4 for fused_marlin_moe by @mgoin in #20457
[Bug][Frontend] Fix structure of transcription's decoder_prompt by @sangbumlikeagod in #18809
[Model][3/N] Automatic conversion of CrossEncoding model by @noooop in #20168
[Doc] Fix classification table in list of supported models by @DarkLight1337 in #20489
[CI] add kvcache-connector dependency definition and add into CI build by @panpan0000 in #18193
[Misc] Small: Remove global media connector. Each test should have its own test connector object. by @huachenheli in #20395
Enable V1 for Hybrid SSM/Attention Models by @tdoublep in #20016
[feat]: CUTLASS block scaled group gemm for SM100 by @djmmoss in #19757
[CI Bugfix] Fix pre-commit failures on main by @mgoin in #20502
[Doc] fix mutltimodal_inputs.md gh examples link by @GuyStone in #20497
[Misc] Add security warning for development mode endpoints by @reidliu41 in #20508
[doc] small fix by @reidliu41 in #20506
[Misc] Remove the unused LoRA test code by @jeejeelee in #20494
Fix unknown attribute of topk_indices_dtype in CompressedTensorsW8A8Fp8MoECutlassMethod by @luccafong in #20507
[v1] Re-add fp32 support to v1 engine through FlexAttention by @Isotr0py in #19754
[Misc] Add logger.exception for TPU information collection failures by @reidliu41 in #20510
[Misc] remove unused import by @reidliu41 in #20517
test_attention compat with coming xformers change by @bottler in #20487
[BUG] Fix #20484. Support empty sequence in cuda penalty kernel by @vadiklyutiy in #20491
[Bugfix] Fix missing per_act_token parameter in compressed_tensors_moe by @luccafong in #20509
[BugFix] Fix: ImportError when building on hopper systems by @LucasWilkinson in #20513
[TPU][Bugfix] fix the MoE OOM issue by @yaochengji in #20339
[Frontend] Support image object in llm.chat by @sfeng33 in #19635
[Benchmark] Add support for multiple batch size benchmark through CLI in benchmark_moe.py + Add Triton Fused MoE kernel config for FP8 E=16 on B200 by @b8zhong in #20516
[Misc] call the pre-defined func by @reidliu41 in #20518
[V0 deprecation] Remove V0 CPU/XPU/TPU backends by @WoosukKwon in #20412
[V1] Support any head size for FlexAttention backend by @DarkLight1337 in #20467
[BugFix][Spec Decode] Fix spec token ids in model runner by @WoosukKwon in #20530
[Bugfix] Add use_cross_encoder flag to use correct activation in ClassifierPooler by @DarkLight1337 in #20527

New Contributors

@sangbumlikeagod made their first contribution in #18809
@djmmoss made their first contribution in #19757
@GuyStone made their first contribution in #20497
@bottler made their first contribution in #20487

Full Changelog: v0.9.2rc1...v0.9.2rc2

Contributors

bottler, mgoin, and 25 other contributors

Assets 2

03 Jul 21:54

github-actions

v0.9.2rc1

2f2fcb3

v0.9.2rc1 Pre-release

Pre-release

What's Changed

[Docs] Note that alternative structured output backends are supported by @russellb in #19426
[ROCm][V1] Adding ROCm to the list of plaforms using V1 by default by @gshtras in #19440
[Model] use AutoWeightsLoader for commandr by @py-andy-c in #19399
Add H20-3e fused MoE kernel tuning configs for Qwen3-235B-A22B-FP8 by @Xu-Wenqing in #19401
[BugFix] Allow use_cudagraph to work with dynamic VLLM_USE_V1 by @zou3519 in #19390
[New Model]: Support Qwen3 Embedding & Reranker by @noooop in #19260
[BugFix] Fix docker build cpu-dev image error by @2niuhe in #19394
Fix test_max_model_len in tests/entrypoints/llm/test_generate.py by @houseroad in #19451
[CI] Disable failing GGUF model test by @mgoin in #19454
[Misc] Remove unused MultiModalHasher.hash_prompt_mm_data by @lgeiger in #19422
Add fused MOE config for Qwen3 30B A3B on B200 by @0xjunhao in #19455
Fix Typo in Documentation and Function Name by @leopardracer in #19442
[ROCm] Add rules to automatically label ROCm related PRs by @houseroad in #19405
[Kernel] Support deep_gemm for linear methods by @artetaout in #19085
[Doc] Update V1 User Guide for Hardware and Models by @DarkLight1337 in #19474
[Doc] Fix quantization link titles by @DarkLight1337 in #19478
[Doc] Support "important" and "announcement" admonitions by @DarkLight1337 in #19479
[Misc] Reduce warning message introduced in env_override by @houseroad in #19476
Support non-string values in JSON keys from CLI by @DarkLight1337 in #19471
Add cache to cuda get_device_capability by @mgoin in #19436
Fix some typo by @Ximingwang-09 in #19475
Support no privileged mode on CPU for docker and kubernetes deployments by @louie-tsai in #19241
[Bugfix] Update the example code, make it work with the latest lmcache by @runzhen in #19453
[CI] Update FlashInfer to 0.2.6.post1 by @mgoin in #19297
[doc] fix "Other AI accelerators" getting started page by @davidxia in #19457
[Misc] Fix misleading ROCm warning by @jeejeelee in #19486
[Docs] Remove WIP features in V1 guide by @WoosukKwon in #19498
[Kernels] Add activation chunking logic to FusedMoEModularKernel by @bnellnm in #19168
[AMD] [Quantization] Add override flag for attention dtype instead of using kv_cache_dtype trigger by @rasmith in #17331
[UX] Add Feedback During CUDAGraph Capture by @robertgshaw2-redhat in #19501
[CI/Build] Fix torch nightly CI dependencies by @zou3519 in #19505
[CI] change spell checker from codespell to typos by @andyxning in #18711
[BugFix] Force registration of w8a8_block_fp8_matmul_deepgemm via lazy import by @varun-sundar-rabindranath in #19514
Add Triton Fused MoE kernel config for E=16 on B200 by @b8zhong in #19518
[Frontend] Improve error message in tool_choice validation by @22quinn in #19239
[BugFix] Work-around incremental detokenization edge case error by @njhill in #19449
[BugFix] Handle missing sep_token for Qwen3-Reranker in Score API by @strutive07 in #19522
[AMD][Kernel][BugFix] fix test_rocm_compressed_tensors_w8a8 for rocm by @rasmith in #19509
Fix typo by @2niuhe in #19525
[Security] Prevent new imports of (cloud)pickle by @russellb in #18018
[Bugfix][V1] Allow manual FlashAttention for Blackwell by @mgoin in #19492
[Bugfix] Respect num-gpu-blocks-override in v1 by @jmswen in #19503
[Quantization] Improve AWQ logic by @jeejeelee in #19431
[Doc] Add V1 column to supported models list by @DarkLight1337 in #19523
[NixlConnector] Drop num_blocks check by @NickLucche in #19532
[Perf] Vectorize static / dynamic INT8 quant kernels by @yewentao256 in #19233
Fix TorchAOConfig skip layers by @mobicham in #19265
[torch.compile][ROCm] Fuse quantization onto attention using a torch.compile pass by @ProExpertProg in #16756
[doc] Make top navigation sticky by @reidliu41 in #19540
[Spec Decode][Benchmark] Generalize spec decode offline benchmark to more methods and datasets by @ekagra-ranjan in #18847
[Misc] Turn MOE_DP_CHUNK_SIZE into an env var by @varun-sundar-rabindranath in #19506
[Bugfix] Enforce contiguous input for dynamic_per_token FP8/INT8 quant by @mgoin in #19452
[Doc] Unify structured outputs examples by @aarnphm in #18196
[V1] Resolve failed concurrent structred output requests by @russellb in #19565
Revert "[Build/CI] Add tracing deps to vllm container image (#15224)" by @kouroshHakha in #19378
[BugFix] : Fix Batched DeepGemm Experts by @varun-sundar-rabindranath in #19515
[Bugfix] Fix EAGLE vocab embedding for multimodal target model by @zixi-qi in #19570
[Doc] uses absolute links for structured outputs by @aarnphm in #19582
[doc] fix incorrect link by @reidliu41 in #19586
[Misc] Correct broken docs link by @Zerohertz in #19553
[CPU] Refine default config for the CPU backend by @bigPYJ1151 in #19539
[Fix] bump mistral common to support magistral by @princepride in #19533
[Fix] The zip function in Python 3.9 does not have the strict argument by @princepride in #19549
use base version for version comparison by @BoyuanFeng in #19587
[torch.compile] reorganize the cache directory to support compiling multiple models by @youkaichao in #19064
[BugFix] Honor enable_caching in connector-delayed kvcache load case by @njhill in #19435
[Model] Fix minimax model cache & lm_head precision by @qscqesze in #19592
[Refactor] Remove unused variables in moe_permute_unpermute_kernel.inl by @yewentao256 in #19573
[doc][mkdocs] fix the duplicate Supported features sections in GPU docs by @reidliu41 in #19606
[CUDA] Enable full cudagraph for FlashMLA by @ProExpertProg in #18581
[Doc] Add troubleshooting section to k8s deployment by @annapendleton in #19377
[torch.compile] Use custom ops when use_inductor=False by @WoosukKwon in #19618
Adding "AMD: Multi-step Tests" to amdproduction. by @Concurrensee in #19508
[BugFix] Fix DP Coordinator incorrect debug log message by @njhill in #19624
[V1][Metrics] Deprecate metrics with gpu_ prefix for non GPU specific metrics. by @sahelib25 in #18354
[Bugfix][1/n] Fix the speculative decoding test by setting the target dtype by @houseroad in #19633
[Misc] Modularize CLI Argument Parsing in Benchmark Scripts by @reidliu41 in #19593
[Bugfix] Fix auto dtype casting for BatchFeature by @Isotr0py in #19316
[Hardware][NVIDIA][kernel] Fp4 MOE quant kernel optimization by @jiahanc in #19500
Only build CUTLASS MoE kernels on Hopper by @huydhn in #19648
[Bugfix] Don't attempt to use triton if no driver is active by @kzawora-intel in #19561
[Fix] Convert kv_transfer_config from dict to KVTransferConfig by @maobaolong in #19262
[Perf...

Contributors

rasmith, shawntan, and 158 other contributors

Assets 2

10 Jun 18:30

github-actions

v0.9.1

b6553be

v0.9.1

Highlights

This release features 274 commits, from 123 contributors (27 new contributors!)

Progress in large scale serving
- DP Attention + Expert Parallelism: CUDA graph support (#18724), DeepEP dispatch-combine kernel (#18434), batched/masked DeepGEMM kernel (#19111), CUTLASS MoE kernel with PPLX (#18762)
- Heterogeneous TP (#18833), NixlConnector Enable FlashInfer backend (#19090)
- DP: API-server scaleout with many-to-many server-engine comms (#17546), Support DP with Ray (#18779), allow AsyncLLMEngine.generate to target a specific DP rank (#19102), data parallel rank to KVEventBatch (#18925)
- Tooling: Simplify EP kernels installation (#19412)
RLHF workflow: Support inplace model weights loading (#18745)
Initial full support for Hybrid Memory Allocator (#17996), support cross-layer KV sharing (#18212)
Add FlexAttention to vLLM V1 (#16078)
Various production hardening related to full cuda graph mode (#19171, #19106, #19321)

Model Support

Support Magistral (#19193), LoRA support for InternVL (#18842), minicpm eagle support (#18943), NemotronH support (#18863, #19249)
Enable data parallel for Llama4 vision encoder (#18368)
Add DeepSeek-R1-0528 function call chat template (#18874)

Hardware Support & Performance Optimizations

Add H20-3e fused MoE kernel tuning configs for DeepSeek-R1/V3 (#19205), Qwen3-235B-A22B (#19315)
Blackwell: Add Cutlass MLA backend (#17625), Tunings for SM100 FP8 CUTLASS kernel (#18778), Use FlashInfer by default on Blackwell GPUs (#19118), Tune scaled_fp8_quant by increasing vectorization (#18844)
FP4: Add compressed-tensors NVFP4 support (#18312), FP4 MoE kernel optimization (#19110)
CPU: V1 support for the CPU backend (#16441)
ROCm: Add AITER grouped topk for DeepSeekV2 (#18825)
POWER: Add IBM POWER11 Support to CPU Extension Detection (#19082)
TPU: Initial support of model parallelism with single worker using SPMD (#18011), Multi-LoRA Optimizations for the V1 TPU backend (#15655)
Neuron: Add multi-LoRA support for Neuron. (#18284), Add Multi-Modal model support for Neuron (#18921), Support quantization on neuron (#18283)
Platform: Make torch distributed process group extendable (#18763)

Engine features

Add Lora Support to Beam Search (#18346)
Add rerank support to run_batch endpoint (#16278)
CLI: add run batch (#18804)
Server: custom logging (#18403), allowed_token_ids in ChatCompletionRequest (#19143)
LLM API: make use_tqdm accept a callable for custom progress bars (#19357)
perf: [KERNEL] Sampler. CUDA kernel for applying repetition penalty (#18437)

API Deprecations

Disallow pos-args other than model when initializing LLM (#18802)
Remove inputs arg fallback in Engine classes (#18799)
Remove fallbacks for Embeddings API (#18795)
Remove mean pooling default for Qwen2EmbeddingModel (#18913)
Require overriding get_dummy_text and get_dummy_mm_data (#18796)
Remove metrics that were deprecated in 0.8 (#18837)

Documentation

Add CLI doc (#18871)
Update SECURITY.md with link to our security guide (#18961), Add security warning to bug report template (#19365)

What's Changed

[CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
[Neuron] Support quantization on neuron by @aws-satyajith in #18283
Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py by @mgoin in #18566
[Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
[Build] Fixes for CMake install by @ProExpertProg in #18570
[Core] Improve Tensor serialisation by @lgeiger in #18774
[rocm] Fix wrong attention log by @fxmarty-amd in #18764
[Bugfix] Fix nomic max_model_len by @noooop in #18755
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
[V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
[Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
[Deprecation] Require overriding get_dummy_text and get_dummy_mm_data by @DarkLight1337 in #18796
[Deprecation] Remove unused sync methods in async_timeout by @DarkLight1337 in #18792
[Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
[CI] improve embed testing by @noooop in #18747
Fix PiecewiseCompileInterpreter by @zou3519 in #17338
[BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
[Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
[Frontend] add run batch to CLI by @reidliu41 in #18804
decrement server_load on listen for disconnect by @daniel-salib in #18784
[Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
[Chore] update ty configuration by @aarnphm in #18839
[Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
[V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
[Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
Remove checks for None for fields which should never be None by @hmellor in #17985
[Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
[Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
[Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
[Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
[LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
[Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
[doc] add CLI doc by @reidliu41 in #18871
[Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
[Misc] Replace TODO in serving transcription by @NickLucche in #18895
[Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
[BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
[Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
[Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
[Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
[ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
[Deprecation] Disallow pos-args other than model when initializing LLM by @DarkLight1337 in #18802
[Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
[V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
[BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
[P/D] NixlConnector DP fixes by @wseaton...

Contributors

markmc, rabi, and 121 other contributors

Assets 6

09 Jun 23:48

github-actions

v0.9.1rc1

3a7cd62

v0.9.1rc1 Pre-release

Pre-release

What's Changed

[CI/Build] [TPU] Fix TPU CI exit code by @CAROLZXYZXY in #18282
[Neuron] Support quantization on neuron by @aws-satyajith in #18283
Support datasets in vllm bench serve and sync with benchmark_[serving,datasets].py by @mgoin in #18566
[Bugfix] Disable prefix caching by default for benchmark by @cascade812 in #18771
[Build] Fixes for CMake install by @ProExpertProg in #18570
[Core] Improve Tensor serialisation by @lgeiger in #18774
[rocm] Fix wrong attention log by @fxmarty-amd in #18764
[Bugfix] Fix nomic max_model_len by @noooop in #18755
[Bugfix]: correctly propagate errors message caught at the chat_templating step to the client by @gcalmettes in #18769
[V1] fix torch profiling for V1 offline scenarios by @divakar-amd in #18445
[V1] [Bugfix] eagle bugfix and enable correct lm_head for multimodal (2) by @RonaldBXu in #18781
[Bugfix][FailingTest]Fix test_model_load_with_params.py by @rabi in #18758
[Deprecation] Require overriding get_dummy_text and get_dummy_mm_data by @DarkLight1337 in #18796
[Deprecation] Remove unused sync methods in async_timeout by @DarkLight1337 in #18792
[Deprecation] Remove fallbacks for Embeddings API by @DarkLight1337 in #18795
[CI] improve embed testing by @noooop in #18747
Fix PiecewiseCompileInterpreter by @zou3519 in #17338
[BugFix] FA2 MLA Accuracy Issue by @LucasWilkinson in #18807
[Platform][Dist] Make torch distributed process group extendable by @MengqingCao in #18763
Enable Pydantic mypy checks and convert configs to Pydantic dataclasses by @hmellor in #17599
[Frontend] add run batch to CLI by @reidliu41 in #18804
decrement server_load on listen for disconnect by @daniel-salib in #18784
[Core] Add Lora Support to Beam Search by @alex-jw-brooks in #18346
[Chore] update ty configuration by @aarnphm in #18839
[Misc] fix olmoe model layer for TP > 1 by @lengrongfu in #18828
[V1][Metrics] Remove metrics that were deprecated in 0.8 by @markmc in #18837
[Chore][Spec Decode] Update check NoneType instead of assigning variables by @aarnphm in #18836
[Hardware][TPU][V1] Multi-LoRA Optimisations for the V1 TPU backend by @Akshat-Tripathi in #15655
Remove checks for None for fields which should never be None by @hmellor in #17985
[Core] Enable CUDA graphs for DP + All2All kernels by @varun-sundar-rabindranath in #18724
[Bugfix][ROCm] fix the power of 2 exception from triton_unified_attention.py when running llama4 models and unit test fix by @hongxiayang in #18100
Prevent the cross-encoder logic from being applied to classification tasks by @maxdebayser in #18838
Add ability to use CUDAGraphs with use_inductor=False by @zou3519 in #17345
[Bugfix][TPU] fix moe custom kernel import by @yaochengji in #18853
[Doc][Neuron] Update documentation for Neuron by @elaineyz in #18868
Skip device and quant Pydantic validation to make plugin device work by @Yikun in #18843
Fixes a dead link in nightly benchmark readme by @nerdalert in #18856
[Neuron] Add multi-LoRA support for Neuron. by @aws-satyajith in #18284
[LoRA] Add LoRA support for InternVL by @jeejeelee in #18842
[Doc] Remove redundant spaces from compatibility_matrix.md by @windsonsea in #18891
[doc] add CLI doc by @reidliu41 in #18871
[Bugfix] Fix misleading information in the documentation by @jeejeelee in #18845
[Misc] Replace TODO in serving transcription by @NickLucche in #18895
[Bugfix] Ensure tensors are contiguous during serialisation by @lgeiger in #18860
[BugFix] Update pydantic to fix error on python 3.10 by @ProExpertProg in #18852
Fix an error in dummy weight loading for quantization models by @Chenyaaang in #18855
[Misc][Tools][Benchmark] Add benchmark_serving supports for llama.cpp. by @Duyi-Wang in #18692
[Doc] Fix codeblocks formatting in LoRA adapters documentation by @Zerohertz in #18907
[Bugfix] Fix the failing gte embedding test by @Isotr0py in #18720
[Attention][V1] Toggle for v1 attention backend by @gshtras in #18275
[ROCm][V0][Attention] Revert to the previous FA triton kernel by @gshtras in #18226
[Deprecation] Disallow pos-args other than model when initializing LLM by @DarkLight1337 in #18802
[Misc] Remove duplicate init for self.vllm_config by @googs1025 in #18896
[V1] Allocate kv_cache with stride order for V1 by @NickLucche in #18775
[BugFix] Make DP work with connector-delayed new requests by @njhill in #18559
[P/D] NixlConnector DP fixes by @wseaton in #18903
Use standalone_compile by default in torch >= 2.8.0 by @zou3519 in #18846
[TPU] remove transpose ops in moe kernel by @yaochengji in #18923
[Bugfix] Fix PP default fallback behavior for V1 by @mgoin in #18915
[Misc] Update type annotation for rotary embedding base by @DarkLight1337 in #18914
[TPU][CI/CD] Clean up docker for TPU tests. by @CAROLZXYZXY in #18926
improve the robustness of parsing vlms config in AutoRound by @wenhuach21 in #18894
[Bugfix] Consistent ascii handling in tool parsers by @chaunceyjiang in #18883
[Model] Use AutoWeightsLoader for mamba2 by @jinyouzhi in #18918
[docs] fix: fix markdown syntax by @eric-haibin-lin in #18927
[ROCm] Remove unnecessary assertion of max_model_len in ROCM_AITER_MLA attention backend. by @vllmellm in #18938
[Bugfix] Remove NVFP4 scales assertions to fix load_format=dummy by @mgoin in #18861
[Deprecation] Remove mean pooling default for Qwen2EmbeddingModel by @DarkLight1337 in #18913
[Misc]Fix benchmarks/README.md for speculative decoding by @rabi in #18897
[doc] add mkdocs doc by @reidliu41 in #18930
[Model] Use in-place adds in SigLIP by @lgeiger in #18922
[Bugfix][Failing Test] Fix test_vllm_port.py by @rabi in #18618
[Misc]Fix typo by @Always-Naive in #18947
[Bugfix][TPU] Fix tpu model runner testcase failure by @CAROLZXYZXY in #18810
[CI/Build] remove regex from build dependencies by @dtrifiro in #18945
[Feature] minicpm eagle support by @huangyuxiang03 in #18943
[doc] show the count for fork and watch by @reidliu41 in #18950
[Docs] Update SECURITY.md with link to our security guide by @russellb in #18961
Improve "failed to get the hash of the compiled graph" error by @zou3519 in #18956
[Perf] API-server scaleout with many-to-many server-engine comms by @njhill in #17546
Benchmark script for fp8 vs bf16 gemm by @mgoin in #17126
[VLM] Add PP support and fix GPTQ inference for Ovis models by @Isotr0py in #18958
[Misc] add group_size is -1 in awq quantization by @lengrongfu in #18910
Tool parser regex timeout handling by @wseaton in https://github.com/vl...

Contributors

markmc, rabi, and 114 other contributors

Assets 6

30 May 16:11

github-actions

v0.9.0.1

5fbbfe9

v0.9.0.1

This patch release contains important bugfix for DeepSeek family of models on NVIDIA Ampere and below (#18807)

Full Changelog: v0.9.0...v0.9.0.1

Assets 6

15 May 03:38

github-actions

v0.9.0

5873877

v0.9.0

Highlights

This release features 649 commits, from 215 contributors (82 new contributors!)

vLLM has upgraded to PyTorch 2.7! (#16859) This is a breaking change for environment dependency.
- The default wheel has been upgraded from CUDA 12.4 to CUDA 12.8. We will distribute CUDA 12.6 wheel on GitHub artifact.
- As a general rule of thumb, our CUDA version policy follow PyTorch's CUDA version policy.
Enhanced NVIDIA Blackwell support. vLLM now ships with initial set of optimized kernels on NVIDIA Blackwell with both attention and mlp.
- You can use our docker image or install FlashInfer nightly wheel pip install https://download.pytorch.org/whl/cu128/flashinfer/flashinfer_python-0.2.5%2Bcu128torch2.7-cp38-abi3-linux_x86_64.whl then set VLLM_ATTENTION_BACKEND=FLASHINFER for better performance.
- Upgraded support for the new FlashInfer main branch. (#15777)
- Please checkout #18153 for the full roadmap
Initial DP, EP, PD support for large scale inference
- EP:
  - Permute and unpermute kernel for moe optimization (#14568)
  - Modularize fused experts and integrate PPLX kernels (#15956)
  - Refactor pplx init logic to make it modular (prepare for deepep) (#18200)
  - Add ep group and all2all interface (#18077)
- DP:
  - Decouple engine process management and comms (#15977)
- PD:
  - NIXL Integration (#17751)
  - Local attention optimization for NIXL (#18170)
  - Support multiple kv connectors (#17564)
Migrate docs from Sphinx to MkDocs (#18145, #18610, #18614, #18616. #18622, #18626, #18627, #18635, #18637, #18657, #18663, #18666, #18713)

Notable Changes

Removal of CUDA 12.4 support due to PyTorch upgrade to 2.7.
Change top_k to be disabled with 0 (still accept -1 for now) (#17773)
The seed is now set to 0 by default for V1 Engine, meaning that different vLLM runs now yield the same outputs even if temperature > 0. This does not modify the random state in user code since workers are run in separate processes unless VLLM_USE_V1_MULTIPROCESSING=0. (#17929, #18741)

Model Enhancements

Support MiMo-7B (#17433), MiniMax-VL-01 (#16328), Ovis 1.6 (#17861), Ovis 2 (#15826), GraniteMoeHybrid 4.0 (#17497), FalconH1* (#18406), LlamaGuard4 (#17315)
- Please install the development version of transformers (from source) to use Falcon-H1.
Embedding models: nomic-embed-text-v2-moe (#17785), new class of gte models (#17986)
Progress in Hybrid Memory Allocator (#17394, #17479, #17474, #17483, #17193, #17946, #17945, #17999, #18001, #18593)
DeepSeek: perf enhancement by moving more calls into cuda-graph region(#17484, #17668), Function Call (#17784), MTP in V1 (#18435)
Qwen2.5-1M: Implements dual-chunk-flash-attn backend for dual chunk attention with sparse attention support (#11844)
Qwen2.5-VL speed enhancement via rotary_emb optimization (#17973)
InternVL models with Qwen2.5 backbone now support video inputs (#18499)

Performance, Production and Scaling

Support full cuda graph in v1 (#16072)
Pipeline Parallelism: MultiprocExecutor support (#14219), torchrun (#17827)
Support sequence parallelism combined with pipeline parallelism (#18243)
Async tensor parallelism using compilation pass (#17882)
Perf: Use small max_num_batched_tokens for A100 (#17885)
Fast Model Loading: Tensorizer support for V1 and LoRA (#17926)
Multi-modality: Automatically cast multi-modal input dtype before transferring device (#18756)

Security

Prevent side-channel attacks via cache salting (#17045)
Fix image hash collision in certain edge cases (#17378)
Add VLLM_ALLOW_INSECURE_SERIALIZATION env var (#17490)
Migrate to REGEX Library to prevent catastrophic backtracking (#18454, #18750)

Features

CLI: deprecated=True (#17426)
Frontend: progress bar for adding requests (#17525), chat_template_kwargs in LLM.chat (#17356), /classify endpoint (#17032), truncation control for embedding models (#14776), cached_tokens in response usage (#18149)
LoRA: default local directory LoRA resolver plugin. (#16855)
Metrics: kv event publishing (#16750), API for accessing in-memory Prometheus metrics (#17010)
Quantization: nvidia/DeepSeek-R1-FP4 (#16362), Quark MXFP4 format (#16943), AutoRound (#17850), torchao models with AOPerModuleConfig (#17826), CUDA Graph support for V1 GGUF support (#18646)
Reasoning: deprecate --enable-reasoning (#17452)
Spec Decode: EAGLE share input embedding (#17326), torch.compile & cudagraph to EAGLE (#17211), EAGLE3 (#17504), log accumulated metrics(#17913), Medusa (#17956)
Structured Outputs: Thinking compatibility (#16577), Spec Decoding (#14702), Qwen3 reasoning parser (#17466), tool_choice: required for Xgrammar (#17845), Structural Tag with Guidance backend (#17333)
Transformers backend: named parameters (#16868), interleaved sliding window attention (#18494)

Hardwares

NVIDIA: cutlass support for blackwell fp8 blockwise gemm (#14383)
TPU: Multi-LoRA implementation(#14238), default max-num-batched-tokens (#17508), V1 backend by default (#17673), top-logprobs (#17072)
Neuron: NeuronxDistributedInference support (#15970), Speculative Decoding, Dynamic on-device sampling (#16357), Mistral Model (#18222), Multi-LoRA (#18284)
AMD: Enable FP8 KV cache on V1 (#17870), Tuned fused moe config for Qwen3 MoE on MI300X (#17535, #17530), AITER biased group topk (#17955), Block-Scaled GEMM (#14968), MLA (#17523), Radeon GPU use Custom Paged Attention (#17004), reduce the number of environment variables in command line (#17229)
Extensibility: Make PiecewiseBackend pluggable and extendable (#18076)

Documentation

Update quickstart and install for cu128 using --torch-backend=auto (#18505)
NVIDIA TensorRT Model Optimizer (#17561)
Usage of Qwen3 thinking (#18291)

Developer Facing

Benchmark: Add single turn MTBench to Serving Bench (#17202)
Usability: Decrease import time of vllm.multimodal (#18031)
Code Format: Code formatting using ruff format (#17656, #18068, #18400)
Readability:
- Configuration and arguments unification is now complete! (#17130, #17453, #17562)
- Update deprecated type hinting from Python 3.7 (#18056, #18130, #18132, #18129, #18073, #18072, #18126, #18128, #18057, #18058)
Process:
- Propose a deprecation policy for the project (#17063)
Testing: expanding torch nightly tests (#18004)

What's Changed

Support loading transformers models with named parameters by @wuisawesome in #16868
Add tuned triton fused_moe configs for Qwen3Moe by @mgoin in #17328
[Benchmark] Add single turn MTBench to Serving Bench by @ekagra-ranjan in #17202
[Optim] Compute multimodal hash only once per item by @DarkLight1337 in #17314
implement Structural Tag with Guidance backend by @mmoskal in #17333
[V1][Spec Decode] Make Eagle model arch config driven by @ekagra-ranjan in #17323
[model] make llama4 compatible with pure dense layers by @luccafong in #17315
[Bugfix] Fix numel() downcast in fused_layernorm_dynamic_per_token_quant.cu by @r-barnes in #17316
Ignore '<string>' filepath by @zou3519 in #17330
[Bugfix] Add contiguous call inside rope kernel wrapper by @timzsu in #17091
[Misc] Add a Jinja template to support Mistral3 function calling by @chaunceyjiang in #17195
[Model] support MiniMax-VL-01 model by @qscqesze in #16328
[Misc] Move config fields to MultiModalConfig by @DarkLight1337 in #17343
[Misc]Use a platform independent interface to obtain the device attributes by @ponix-j in #17100
[Fix] Documentation spacing in compilation config help text by @Zerohertz in #17342
[Build][Bugfix] Restrict setuptools version to <80 by @gshtras in #17320
[Model] Ignore rotary embed load for Cohere model by @ekagra-ranjan in #17319
Update docs requirements by @hmellor in #17379
[Doc] Fix QWen3MOE info by @jeejeelee in #17381
[Bugfix] Clean up MiniMax-VL and fix processing by @DarkLight1337 in #17354
pre-commit autoupdate by @hmellor in #17380
[Frontend] Support chat_template_kwargs in LLM.chat by @DarkLight1337 in #17356
Transformers backend tweaks by @hmellor in #17365
Fix: Spelling of inference by @a2q1p in #17387
Improve literal dataclass field conversion to argparse argument by @hmellor in #17391
[V1] Remove num_input_tokens from attn_metadata by @heheda12345 in #17193
[Bugfix] add qwen3 reasoning-parser fix content is None when disable … by @mofanke in #17369
fix gemma3 results all zero by @mayuyuace in #17364
[Misc][ROCm] Exclude cutlass_mla_decode for ROCm build by @tywuAMD in #17289
Enabling multi-group kernel tests. by @Alexei-V-Ivanov-AMD in https://github.com/vllm-p...

Contributors

markmc, rabi, and 198 other contributors

Assets 6

02 May 18:03

github-actions

v0.8.5.post1

3015d56

v0.8.5.post1

This post release contains two bug fix for memory leak and model accuracy

Fix Memory Leak in _cached_reqs_data (#17567)
Fix sliding window attention in V1 giving incorrect results (#17574)

Full Changelog: v0.8.5...v0.8.5.post1

Assets 6

28 Apr 21:13

github-actions

v0.8.5

ba41cc9

v0.8.5

This release contains 310 commits from 143 contributors (55 new contributors!).

Highlights

This release features important multi-modal bug fixes, day 0 support for Qwen3, and xgrammar's structure tag feature for tool calling.

Model Support

Day 0 support for Qwen3 and Qwen3MoE. This release fixes fp8 weight loading (#17318) and adds tuned MoE configs (#17328).
Add ModernBERT (#16648)
Add Granite Speech Support (#16246)
Add PLaMo2 (#14323)
Add Kimi-VL model support (#16387)
Add Qwen2.5-Omni model support (thinker only) (#15130)
Snowflake Arctic Embed (Family) (#16649)
Accuracy fixes for Llama4 Int4 (#16801), chat template for Llama 4 models (#16428), enhanced AMD support (#16674, #16847)

V1 Engine

Add structural_tag support using xgrammar (#17085)
Disaggregated serving:
- KV Connector API V1 (#15960)
- Adding LMCache KV connector for v1 (#16625)
Clean up: Remove Sampler from Model Code (#17084)
MLA: Simplification to batch P/D reordering (#16673)
Move usage stats to worker and start logging TPU hardware (#16211)
Support FlashInfer Attention (#16684)
Faster incremental detokenization (#15137)
EAGLE-3 Support (#16937)

Features

Validate urls object for multimodal content parts (#16990)
Prototype support sequence parallelism using compilation pass (#16155)
Add sampling params to v1/audio/transcriptions endpoint (#16591)
Enable vLLM to Dynamically Load LoRA from a Remote Server (#10546)
Add vllm bench [latency, throughput] CLI commands (#16508)

Performance

Attention:
- FA3 decode perf improvement - single mma warp group support for head dim 128 (#16864)
- Update to lastest FA3 code (#13111)
- Support Cutlass MLA for Blackwell GPUs (#16032)
MoE:
- Add expert_map support to Cutlass FP8 MOE (#16861)
- Add fp8_w8a8 fused MoE kernel tuning configs for DeepSeek V3/R1 on NVIDIA H20 (#16753)
Support Microsoft Runtime Kernel Lib for our Low Precision Computation - BitBLAS (#6036)
Optimize rotary_emb implementation to use Triton operator for improved performance (#16457)

Hardwares

TPU:
- Enable structured decoding on TPU V1 (#16499)
- Capture multimodal encoder during model compilation (#15051)
- Enable Top-P (#16843)
AMD:
- AITER Fused MOE V1 Support (#16752)
- Integrate Paged Attention Kernel from AITER (#15001)
- Support AITER MLA (#15893)
- Upstream prefix prefill speed up for vLLM V1 (#13305)
- Adding fp8 and variable length sequence support to Triton FAv2 kernel (#12591)
- Add skinny gemms for unquantized linear on ROCm (#15830)
- Follow-ups for Skinny Gemms on ROCm. (#17011)

Documentation

Add open-webui example (#16747)
Document Matryoshka Representation Learning support (#16770)
Add a security guide (#17230)
Add example to run DeepSeek with Ray Serve LLM (#17134)
Benchmarks for audio models (#16505)

Security and Dependency Updates

Don't bind tcp zmq socket to all interfaces (#17197)
Use safe serialization and fix zmq setup for mooncake pipe (#17192)
Bump Transformers to 4.51.3 (#17116)

Build and testing

Add property-based testing for vLLM endpoints using an API defined by an OpenAPI 3.1 schema (#16721)

Breaking changes 🚨

--enable-chunked-prefill, --multi-step-stream-outputs, --disable-chunked-mm-input can no longer explicitly be set to False. Instead, add no- to the start of the argument (i.e. --enable-chunked-prefill and --no-enable-chunked-prefill) (#16533)

What's Changed

Improve configs - SchedulerConfig by @hmellor in #16533
[Misc] remove warning if triton>=3.2.0 by @DefTruth in #16553
[Misc] refactor examples by @reidliu41 in #16563
[Misc] Update usage with mooncake lib for kv transfer by @ShangmingCai in #16523
[fix]: Dockerfile.ppc64le fixes for opencv-python and hf-xet by @Shafi-Hussain in #16048
[Bugfix] Multi-modal caches not acting like LRU caches by @DarkLight1337 in #16593
[TPU][V1] Fix exponential padding when max-num-batched-tokens is not a power of 2 by @NickLucche in #16596
Fix triton install condition on CPU by @hmellor in #16600
s390x: Fix PyArrow build and add CPU test script for Buildkite CI by @Nash-123 in #16036
[Model][VLM] Add Kimi-VL model support by @courage17340 in #16387
[Hardware][TPU] Add torchvision to tpu dependency file by @lsy323 in #16616
[DOC][TPU] Add core idea about avoiding recompilation after warmup by @yaochengji in #16614
config check sleep mode support oot platforms by @celestialli in #16562
[Core][Bugfix] Fix Offline MM Beam Search by @alex-jw-brooks in #16390
[Kernel] moe wna16 marlin kernel by @jinzhen-lin in #14447
[BugFix]: Update minimum pyzmq version by @taneem-ibrahim in #16549
[Bugfix] Fix tests/kernels/test_mamba_ssm_ssd.py by @tlrmchlsmth in #16623
[Bugfix] Fix broken GritLM model and tests (missing pooling_metadata) by @pooyadavoodi in #16631
Add vllm bench [latency, throughput] CLI commands by @mgoin in #16508
Fix vLLM x torch.compile config caching by @zou3519 in #16491
[Misc] refactor argument parsing in examples by @reidliu41 in #16635
[CI/Build] Fix LoRA OOM by @jeejeelee in #16624
Add "/server_info" endpoint in api_server to retrieve the vllm_config. by @Cangxihui in #16572
[Kernel] Remove redundant Exp calculations by @DefTruth in #16123
[Misc] Update compressed-tensors WNA16 to support zero-points by @dsikka in #14211
[Misc] Enable vLLM to Dynamically Load LoRA from a Remote Server by @angkywilliam in #10546
[Model] Add PLaMo2 by @Alnusjaponica in #14323
[Bugfix] fix gpu docker image mis benchmarks dir by @lengrongfu in #16628
[Misc] Modify LRUCache touch by @jeejeelee in #16689
Disable remote caching when calling compile_fx by @zou3519 in #16611
[Feature] add model aware kv ops helper by @billishyahao in #16020
[ROCM] Bind triton version to 3.2 in requirements-built.txt by @SageMoore in #16664
[V1][Structured Output] Move xgrammar related utils to backend_xgrammar.py by @shen-shanshan in #16578
[CI] Cleanup additional_dependencies: [toml] for pre-commit yapf hook by @yankay in #16405
[Misc] refactor examples series by @reidliu41 in #16708
[Doc] Improve OOM troubleshooting by @DarkLight1337 in #16704
[Bugfix][Kernel] fix potential cuda graph broken for merge_attn_states kernel by @DefTruth in #16693
[Model] support modernbert by @xsank in #16648
[Hardware] Add processor inputs to platform validation by @joerunde in #16680
Improve error for structured output backend selection by @hmellor in #16717
[Misc] Remove redundant comment by @jianzs in #16703
Help user create custom model for Transformers backend remote code models by @hmellor in #16719
[V1][Performance] Implement custom serializaton for MultiModalKwargs [Rebased] by @p88h in #16432
[V1][Spec Dec Bug Fix] Respect Spec Dec Method Specification by @luyuzhe111 in #16636
Adding vllm buildkite job for IBM Power by @AaruniAggarwal in #16679
[V1][Frontend] Improve Shutdown And Logs by @robertgshaw2-redhat in #11737
[rocm][V0] fix selection logic for custom PA in V0 by @divakar-amd in #16426
[Bugfix] Update Florence-2 tokenizer to make grounding tasks work by @Isotr0py in #16734
[Bugfix] Revert max_prompt_len validation for decoder-only models. by @davidheineman in #16741
[V1] Remove log noise when idle by @russellb in #16735
[Ray] Improve documentation on batch inference by @richardliaw in #16609
[misc] ignore marlin_moe_wna16 local gen codes by @DefTruth in #16760
[Doc] Add more tips to avoid OOM by @DarkLight1337 in #16765
[doc] add open-webui example by @reidliu41 in #16747...

Contributors

markmc, rasmith, and 130 other contributors

Assets 6

14 Apr 06:14

github-actions

v0.8.4

dc1b4a6

v0.8.4

This release contains 180 commits from 84 contributors (25 new contributors!).

Highlights

This release includes important accuracy fixes for Llama4 models, if you are using it, we highly recommend you to update.

Model

Llama4 (#16113,#16509) bug fix and enhancements:
- qknorm should be not shared across head (#16311)
- Enable attention temperature tuning by default for long context (>32k) (#16439)
- Index Error When Single Request Near Max Context (#16209)
- Add tuned FusedMoE kernel config for Llama4 Scout, TP=8 on H100 (#16488)
- Update to transformers==4.51.1 (#16257)
- Added chat templates for LLaMa4 pythonic tool calling (#16463)
- Optimized topk for topk=1(#16512)
- Add warning for Attention backends that do not support irope yet (#16212)
Support Qwen3 and Qwen3MoE (#15289), smolvlm (#16017), jinaai/jina-embeddings-v3 (#16120), InternVL3 (#16495), GLM-4-0414 (#16338)

API

Estimate max-model-len use available KV cache memory. The error message nows hints at how to set --max-model-len (#16168)
Add hf_token to EngineArgs (#16093)
Enable regex support with xgrammar in V0 engine (#13228)
Support matryoshka representation / support embedding API dimensions (#16331)
Add bucket for request_latency, time_to_first_token and time_per_output_token (#15202)
Support for TorchAO quantization (#14231)

Hardware

Intel-Gaudi: Multi-step scheduling implementation for HPU (#12779)
TPU:
- Make @support_torch_compile work for XLA backend (#15782)
- Use language_model interface for getting text backbone in MM (#16410)

Performance

DeepSeek MLA: a new merge_attn_states CUDA kernel, 3x speedup (#16173)
MoE: Support W8A8 channel-wise weights and per-token activations in triton fused_moe_kernel (#16366)
Add support to modelopt quantization of Mixtral model (#15961)
Enable PTPC FP8 for CompressedTensorsW8A8Fp8MoEMethod (triton fused_moe) (#16537)

V1 Engine Core

Enable multi-input by default (#15799)
Scatter and gather placeholders in the model runner (#16076)
Set structured output backend to auto by default (#15724)
Zero-copy tensor/ndarray serialization/transmission (#13790)
Eagle Model loading (#16035)
KV cache slots for eagle heads (#16370)
Add supports_structured_output() method to Platform (#16148)

Developer Facing

Add sampling parameters to benchmark_serving. (#16022)
AutoWeightsLoader refacotring (#16383, #16325, #16088, #16203, #16103)
Unifieid configuration with engine args: LoadConfig (#16422), ParallelConfig (#16332)

What's Changed

[Misc] Auto detect bitsandbytes pre-quantized models by @tristanleclercq in #16027
[CI] Fix benchmark script level by @khluu in #16089
fix: support clang17 for macos and fix the real libomp by @yihong0618 in #16086
[doc] fix 404 by @reidliu41 in #16082
Revert "doc: add info for macos clang errors (#16049)" by @yihong0618 in #16091
Fix some capitalisations in generated examples doc titles by @hmellor in #16094
[Misc] format output for encoder_decoder.py by @reidliu41 in #16095
[Misc] Remove redundant code by @chaunceyjiang in #16098
[Bugfix] fix use_atomic_add support of marlin kernel when using v1 engine by @jinzhen-lin in #15946
[Model] use AutoWeightsLoader for phi, gemma, deepseek by @jonghyunchoe in #16088
[Model] fix model testing for TeleChat2ForCausalLM and V0 llama4 by @luccafong in #16112
[Benchmark] Add sampling parameters to benchmark_serving. by @hyeygit in #16022
[Frontend] Fix typo in tool chat templates for llama3.2 and toolace by @bjj in #14501
[CI][V1] Fix passing tokenizer as kwarg to validate_guidance_grammar by @ywang96 in #16117
[Misc] refactor example eagle by @reidliu41 in #16100
[Doc][Bugfix] Add missing EOF in k8s deploy doc by @psschwei in #16025
[Misc] Improve model redirect to accept json dictionary by @Isotr0py in #16119
[Model] use AutoWeightsLoader for stablelm,starcoder2,zamba2 by @lengrongfu in #16103
[Bugfix] LoRA : Fix the order in which the kernels process LoRAs by @varun-sundar-rabindranath in #16040
[Bugfix] add hf_token to EngineArgs by @paolovic in #16093
[Misc] update requires-python in pyproject.toml by @reidliu41 in #16116
[TPU] Update PyTorch/XLA by @yaochengji in #16130
[V1][Minor] Optimize get_cached_block by @WoosukKwon in #16135
Fix requires-python by @martinhoyer in #16132
[Metrics] Add bucket for request_latency, time_to_first_token and time_per_output_token by @yankay in #15202
[V1][Minor] Minor simplification for get_computed_blocks by @WoosukKwon in #16139
[Misc] Update Mistral-3.1 example by @DarkLight1337 in #16147
[Bugfix] Make dummy encoder prompt padding alternative and add missing warnings by @Isotr0py in #16129
[CI] Set max transformers version for Ultravox model test by @ywang96 in #16149
doc: fix some typos in doc by @yihong0618 in #16154
[VLM] Florence-2 supports online serving by @Isotr0py in #16164
[V1][Structured Output] Add supports_structured_output() method to Platform by @shen-shanshan in #16148
[Model] Add Qwen3 and Qwen3MoE by @YamPengLi in #15289
[Misc] improve example mlpspeculator and llm_engine_example by @reidliu41 in #16175
[Doc]Update image to latest version by @WangErXiao in #16186
Upstream Llama4 Support to Main by @houseroad in #16113
[Bugfix] Re-enable support for ChatGLMForConditionalGeneration by @DarkLight1337 in #16187
[V1] Revert the default max_num_seqs to V0 values for most hardware by @DarkLight1337 in #16158
[Misc] Print encoder seq len to short warning only once by @gshtras in #16193
[Misc] Human-readable max-model-len cli arg by @NickLucche in #16181
[Misc] Move Llama 4 projector call into encoder execution by @ywang96 in #16201
[Bugfix] Fix guidance backend for Qwen models by @benchislett in #16210
[V1][BugFix] Exit properly if engine core fails during startup by @njhill in #16137
[Misc] add description attribute in CLI by @reidliu41 in #15921
[Bugfix][V0] XGrammar structured output supports Enum by @leon-seidel in #15878
Torchao by @drisspg in #14231
[ROCm][Bugfix][FP8] Make fp8 quant respect fused modules mapping by @mgoin in #16031
[core] do not send error across process by @youkaichao in #16174
[Misc] Update compressed-tensors to version 0.9.3 by @mlsw in #16196
Update BASE_IMAGE to 2.22 release of Neuron by @aws-satyajith in #16218
[V1] Scatter and gather placeholders in the model runner by @ywang96 in #16076
[Bugfix] fix use-ep bug to enable ep by dp/tp size > 1 by @zxfan-cpu in #16161
Add warning for Attention backends that do not support irope yet by @sarckk in #16212
[Bugfix] Do not skip "empty" parts of chats that are parsable by @mgoin in #16219
[Bugfix] Fix and reorganize broken GGUF tests and bump gguf version by @Isotr0py in #16194
[torch.compile][TPU] Make @support_torch_compile work for XLA backend by @lsy323 in #15782
[V1] Add disable_chunked_mm_input arg to disable partial mm input prefill by @mgoin in #15837
[Misc] Merge the logs of pp layers partitions by @kebe7jun in #16225
[Docs] Add Slides from Singapore Meetup by @simon-mo in #16213
[Misc] format and refactor some examples by @reidliu41 in #16252
[Misc] Add warning for multimodal data in LLM.beam_search by @alex-jw-brooks in #16241
[Model] use AutoWeightsLoader for phimoe,qwen2_moe,qwen3_moe b...