Description
Issue to be published in the llama.cpp github:
I am using the Docker Image ghcr.io/ggerganov/llama.cpp:server-cuda to deploy the server in a Kubernetes cluster in AWS using four A10G gpus. This is the configuration setup:
- name: llama-cpp-server image: ghcr.io/ggerganov/llama.cpp:server-cuda args: - "--model" - "/models/mixtral-8x7b-instruct-v0.1.Q5_K_M.gguf" - "--port" - "8000" - "--host" - "0.0.0.0" - "--ctx-size" - "100000" - "--n-gpu-layers" - "256" - "--cont-batching" - "--parallel" - "10" - "--batch-size" - "4096"
(not sure if it adds context, but I'm using a persistentVolumeClaim where I download and persist the model)
I already reviewed the server readme and all the command line options and also tested different image tags for server-cuda from the past days.
Based on this discussion y understand I have 10 slots for processing parallel requests, and I could be able to process 10 sequences with 10000 tokens each. The gpu I'm using should be able to process this load.
With this configuration, I executed a test for sending 5 concurrent requests of ~2300 tokens each. I understand this should be way below the maximum processable limit, but I'm getting a connection closed from the server while its is still processing the tasks in the used slots. The process is the following:
- I send multiple requests to the server (5)
- The server gets disconnected without sending a response for some of the requests
- I check again the /health and see that the slots are still running
- I check the logs for the server and see that all tasks finish successfully. I don't see any error logs in the server
I am trying to understand if there is some additional configuration I'm missing or how can I improve concurrency in these cases without handling connection error from outside (additionally, when a the connection gets closed, I cannot reprocess the requests immediately since the server is still processing the previous requests)