server-cuda closes connection while still processing tasks

Issue to be published in the llama.cpp github: 


I am using the Docker Image ghcr.io/ggerganov/llama.cpp:server-cuda to deploy the server in a Kubernetes cluster in AWS using four A10G gpus. This is the configuration setup: 

>     - name: llama-cpp-server
>         image: ghcr.io/ggerganov/llama.cpp:server-cuda
>         args:
>         - "--model"
>         - "/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf"
>         - "--port"
>         - "8000"
>         - "--host"
>         - "0.0.0.0"
>         - "--ctx-size"
>         - "100000"
>         - "--n-gpu-layers"
>         - "256"
>         - "--cont-batching"
>         - "--parallel" 
>         - "10"
>         - "--batch-size"
>         - "4096"

(not sure if it adds context, but I'm using a persistentVolumeClaim where I download and persist the model)

I already reviewed the server readme and all the command line options and also tested different image tags for server-cuda from the past days. 

Based on [this discussion](https://github.com/ggerganov/llama.cpp/discussions/4130#discussioncomment-8053636) y understand I have 10 slots for processing parallel requests, and I could be able to process 10 sequences with 10000 tokens each. The gpu I'm using should be able to process this load. 

With this configuration, I executed a test for sending 5 concurrent requests of ~2300 tokens each. I understand this should be way below the maximum processable limit, but I'm getting a connection closed from the server while its is still processing the tasks in the used slots. The process is the following:

1. I send multiple requests to the server (5)
2. The server gets disconnected without sending a response for some of the requests
3. I check again the /health and see that the slots are still running
4.  I check the logs for the server and see that all tasks finish successfully. I don't see any error logs in the server

I am trying to understand if there is some additional configuration I'm missing or how can I improve concurrency in these cases without handling connection error from outside (additionally, when a the connection gets closed, I cannot reprocess the requests immediately since the server is still processing the previous requests)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

server-cuda closes connection while still processing tasks #6545

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

server-cuda closes connection while still processing tasks #6545

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions