[Bug]: v1 AsyncLLM model hangs if provided with 2 successive batches of items to process #17385

clefourrier · 2025-04-29T12:36:42Z

Your current environment

Possibly relevant packages from pip list

Package                                  Version       Editable project location
---------------------------------------- ------------- -------------------------
accelerate                               1.6.0
huggingface-hub                          0.30.2
mistral_common                           1.5.4
nest_asyncio                             1.6.0
numpy                                    1.26.4
nvidia-cublas-cu12                       12.4.5.8
nvidia-cuda-cupti-cu12                   12.4.127
nvidia-cuda-nvrtc-cu12                   12.4.127
nvidia-cuda-runtime-cu12                 12.4.127
nvidia-cudnn-cu12                        9.1.0.70
nvidia-cufft-cu12                        11.2.1.3
nvidia-cufile-cu12                       1.11.1.6
nvidia-curand-cu12                       10.3.5.147
nvidia-cusolver-cu12                     11.6.1.9
nvidia-cusparse-cu12                     12.3.1.170
nvidia-cusparselt-cu12                   0.6.2
nvidia-nccl-cu12                         2.21.5
nvidia-nvjitlink-cu12                    12.4.127
nvidia-nvtx-cu12                         12.4.127
openai                                   1.55.2
pillow                                   11.2.1
pip                                      25.0.1
protobuf                                 3.20.3
pyarrow                                  19.0.1
pycountry                                24.6.1
pycparser                                2.22
pydantic                                 2.11.3
pydantic_core                            2.33.1
ray                                      2.43.0
regex                                    2024.11.6
ruff                                     0.2.2
safetensors                              0.5.3
stanza                                   1.10.1
starlette                                0.46.2
tokenizers                               0.21.1
torch                                    2.6.0
torchaudio                               2.6.0
torchvision                              0.21.0
tqdm                                     4.67.1
transformers                             4.51.3
triton                                   3.2.0
urllib3                                  2.4.0
uvicorn                                  0.34.2
vllm                                     0.8.4

Possibly relevant env vars

"VLLM_LOGGING_LEVEL":"DEBUG",
"VLLM_TRACE_FUNCTION": "1", 
"CUBLAS_WORKSPACE_CONFIG": ":4096:8",
"VLLM_USE_V1":"1",
"VLLM_HOST_IP":"localhost",
"VLLM_WORKER_MULTIPROC_METHOD": "spawn",

🐛 Describe the bug

Expected behavior: launch an AsyncLLM model, then use it to generate on X successive batches without issues (here, a batch is a group of inputs provided together in the asyncio run call).
Actual behavior: the AsyncLLM models processes the first batch without issue, then hangs on the second batch.

Notes:

MWE ran on a node of 2 H100 GPUs.
As you can see, we're using DP2 TP1 PP1

import asyncio
from vllm.v1.engine.async_llm import AsyncLLM
from vllm import AsyncEngineArgs, RequestOutput, SamplingParams

async def _async_one_item(
    model,
    params,
    index: str,
):
    output = await anext(
        model.generate(request_id=index, prompt=f"This is a story about {index}", sampling_params=params)
    )
    return output

async def _async_batch(model, params, name) -> list:
    processed_requests = [
        _async_one_item(model, params, index=f"{name}_{index}")
        for index in range(50)
    ]
    results = await asyncio.gather(*processed_requests)
    return results

# This version is also failing
async def _async_batch_v2(model, params, name) -> list:
    if model.engine_core.is_sleeping_async():
        await model.engine_core.wake_up_async()
    processed_requests = [
        _async_one_item(model, params, index=f"{name}_{index}")
        for index in range(50)
    ]
    results = await asyncio.gather(*processed_requests)
    await model.engine_core.sleep_async()
    return results

def main():
    "model_name=,is_async=True,data_parallel_size=2,tensor_parallel_size=1,dtype=bfloat16,max_model_length=32768,max_num_batched_tokens=32768,gpu_memory_utilization=0.8,generation_parameters={max_new_tokens:32768,temperature:0.6,top_p:0.95}"
    model_args = {
        "model": "Qwen/Qwen2.5-7B",
        "gpu_memory_utilization": 0.8,
        "revision": "main",
        "dtype": "bfloat16",
        "trust_remote_code": True,
        "tensor_parallel_size": 1,
        "data_parallel_size": 2,
        "pipeline_parallel_size": 1,
        "swap_space": 4,
        "seed": int(0),
        "enforce_eager": True,
    }

    model = AsyncLLM.from_engine_args(AsyncEngineArgs(**model_args))

    sampling_params = SamplingParams()
    sampling_params.n = 1
    sampling_params.max_tokens = 10
    sampling_params.logprobs = 1


    # This works just fine
    responses_1: list[RequestOutput] = asyncio.run(_async_batch(model=model, params = sampling_params, name= "exp1"))
    for response in responses_1:
        print(response)

    # This hangs after adding all the requests
    # Last log lines of debug mode:
    # INFO 04-29 12:28:56 [async_llm.py:228] Added request exp2_49.
    # (EngineCore_1 pid=1452003) (EngineCore_0 pid=1452002) 
    # DEBUG 04-29 12:28:59 [core.py:411] EngineCore waiting for work.
    # DEBUG 04-29 12:28:59 [core.py:411] EngineCore waiting for work.
    responses_2: list[RequestOutput] = asyncio.run(_async_batch(model=model, params = sampling_params, name = "exp2"))
    # We never reach this
    for response in responses_2:
        print(response)

if __name__ == "__main__":
    main()

Before submitting a new issue...

Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

The text was updated successfully, but these errors were encountered:

DarkLight1337 · 2025-04-29T12:48:38Z

You called sleep async after the first set of batches. So when you input the second set of batches, the engine is still in sleep mode.

clefourrier · 2025-04-29T12:52:23Z

Ha wait I removed this in my test, let me make sure and edit my example (I had tried manually starting/putting to sleep the core engine).

clefourrier · 2025-04-29T12:55:57Z

I confirm it's also not working when removing the "sleep async" line, I edited the issue to reflect the 2 cases I tried.

DarkLight1337 · 2025-04-29T13:21:09Z

Shouldn't you also await the sleeping async check?

clefourrier · 2025-04-29T13:28:10Z

Hm possibly? Though using _async_batch_v2 1) did not fail, and 2) stopped/started the worker.

However the main issue is not coming from this as the MWE is failing with _async_batch, without trying to manually put to sleep/restart the core engine.

clefourrier · 2025-04-29T13:28:31Z

Did you try running the MWE?

DarkLight1337 · 2025-04-29T13:37:32Z

I am outside right now so haven't tried.

clefourrier · 2025-04-29T13:41:14Z

No problem! When you try it, tell me if you can't reproduce, or if you need further logs :)

DarkLight1337 · 2025-04-29T14:18:32Z

I don't have remaining bandwidth today, sorry. @robertgshaw2-redhat @njhill can you help repro/debug?

njhill · 2025-04-29T14:52:12Z

@clefourrier you shouldn't reuse this across multiple calls to asyncio.run().

See: https://docs.python.org/3.9/library/asyncio-task.html#asyncio.run

You should structure your code like:

async def main():

    model = AsyncLLM.from_engine_args(AsyncEngineArgs(**model_args))

    # ...

    await async_batch(model=model, ...)

    # more batch calls ...

if __name__ == "__main__":
    asyncio.run(main())

clefourrier · 2025-04-29T15:19:31Z

So you don't expect the final interface of the AsyncLLM to be similar to the simple LLM or other models which can be called async, for example all the APIs models with litellm/openai/tgi?

njhill · 2025-04-29T15:25:11Z

So you don't expect the final interface of the AsyncLLM to be similar to the simple LLM or other models which can be called async, for example all the APIs models with litellm/openai/tgi?

@clefourrier I don't really understand your question. How is it that you're trying to use it? You can many as many async / concurrent batch calls to it as you like, it just needs to be in the context of a single event loop. This is more of a "how python asyncio works" kind of thing than anything to do with the AsyncLLM interface.

clefourrier · 2025-04-29T15:53:25Z

It's highly possible I'm missing something.

In evaluation, I usually want to run together requests of a similar type (for example, generations with similar sampling parameters, or requests looking at logprobs vs requests looking at generated text), each is in its own loop, which are not happening at the same time and are not concurrent with the others.
Ex: loop of request type A, loop of request type B, etc.

I assumed that since using asyncio.run always creates and closes a new event loop, I could run it consecutively, but that might be an issue with my reasoning.
(If that is the case, sorry for any time wasted on your side)

clefourrier · 2025-04-29T16:04:54Z

Ok let me come back to you on this

njhill · 2025-04-29T16:06:03Z

Can't you can run consecutively using the same event loop?

clefourrier · 2025-04-29T16:12:20Z

Not trivially given where I'm running this ^^

clefourrier · 2025-04-29T16:47:23Z

Ok thanks to you both for your help, ended up changing the pipeline code of our eval suite to support this!
Thanks for taking the time to answer in detail, I learned stuff from this interaction! :)

clefourrier added the bug Something isn't working label Apr 29, 2025

clefourrier mentioned this issue Apr 29, 2025

Async vllm huggingface/lighteval#693

Merged

clefourrier closed this as completed Apr 29, 2025

Uh oh!

[Bug]: v1 AsyncLLM model hangs if provided with 2 successive batches of items to process #17385

[Bug]: v1 AsyncLLM model hangs if provided with 2 successive batches of items to process #17385

Comments

clefourrier commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Your current environment

🐛 Describe the bug

Before submitting a new issue...

DarkLight1337 commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clefourrier commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

DarkLight1337 commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025

Uh oh!

DarkLight1337 commented Apr 29, 2025

Uh oh!

njhill commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025

Uh oh!

njhill commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025

Uh oh!

njhill commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025

Uh oh!

clefourrier commented Apr 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clefourrier commented Apr 29, 2025 •

edited

Loading

clefourrier commented Apr 29, 2025 •

edited

Loading

clefourrier commented Apr 29, 2025 •

edited

Loading

clefourrier commented Apr 29, 2025 •

edited

Loading