Skip to content

server-cuda closes connection while still processing tasks #6545

Closed
@lucasBerardiniMarvik

Description

@lucasBerardiniMarvik

Issue to be published in the llama.cpp github:

I am using the Docker Image ghcr.io/ggerganov/llama.cpp:server-cuda to deploy the server in a Kubernetes cluster in AWS using four A10G gpus. This is the configuration setup:

- name: llama-cpp-server
    image: ghcr.io/ggerganov/llama.cpp:server-cuda
    args:
    - "--model"
    - "/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf"
    - "--port"
    - "8000"
    - "--host"
    - "0.0.0.0"
    - "--ctx-size"
    - "100000"
    - "--n-gpu-layers"
    - "256"
    - "--cont-batching"
    - "--parallel" 
    - "10"
    - "--batch-size"
    - "4096"

(not sure if it adds context, but I'm using a persistentVolumeClaim where I download and persist the model)

I already reviewed the server readme and all the command line options and also tested different image tags for server-cuda from the past days.

Based on this discussion y understand I have 10 slots for processing parallel requests, and I could be able to process 10 sequences with 10000 tokens each. The gpu I'm using should be able to process this load.

With this configuration, I executed a test for sending 5 concurrent requests of ~2300 tokens each. I understand this should be way below the maximum processable limit, but I'm getting a connection closed from the server while its is still processing the tasks in the used slots. The process is the following:

  1. I send multiple requests to the server (5)
  2. The server gets disconnected without sending a response for some of the requests
  3. I check again the /health and see that the slots are still running
  4. I check the logs for the server and see that all tasks finish successfully. I don't see any error logs in the server

I am trying to understand if there is some additional configuration I'm missing or how can I improve concurrency in these cases without handling connection error from outside (additionally, when a the connection gets closed, I cannot reprocess the requests immediately since the server is still processing the previous requests)

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions