Can llamacpp server serve inferences for 100 pararel users in good conditions ? #13977

lbarasc · 2025-06-02T16:34:30Z

lbarasc
Jun 2, 2025

Hi,
I need to build a custom computer to serve llama 3.2 model for 100 parallel users in good conditions with llamacpp under windows or linux.
What GPU do you use for that ? 5090 with 24gb ? what CPU (intel core 9 or Amd Epyc ?) how much RAM (128 gb, 256 gb ?)
Do you think with all these, speed of token generation will be enough for all my 100 max users ?

ExtReMLapin · 2025-06-03T07:19:07Z

ExtReMLapin
Jun 3, 2025

No, to some degree batched inference increases total t/s your GPU can produce but individual prompt speeds shuts down because of this bug : #10860

For efficient production serving 100 users, VLLM is closer to the 'production readiness' you're looking for

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Can llamacpp server serve inferences for 100 pararel users in good conditions ? #13977

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Can llamacpp server serve inferences for 100 pararel users in good conditions ? #13977

Uh oh!

lbarasc Jun 2, 2025

Replies: 1 comment

Uh oh!

Uh oh!

ExtReMLapin Jun 3, 2025

lbarasc
Jun 2, 2025

ExtReMLapin
Jun 3, 2025