Replies: 1 comment
-
No, to some degree batched inference increases total t/s your GPU can produce but individual prompt speeds shuts down because of this bug : #10860 For efficient production serving 100 users, VLLM is closer to the 'production readiness' you're looking for |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi,
I need to build a custom computer to serve llama 3.2 model for 100 parallel users in good conditions with llamacpp under windows or linux.
What GPU do you use for that ? 5090 with 24gb ? what CPU (intel core 9 or Amd Epyc ?) how much RAM (128 gb, 256 gb ?)
Do you think with all these, speed of token generation will be enough for all my 100 max users ?
Beta Was this translation helpful? Give feedback.
All reactions