Closed
Description
I have a problem about the inference data posted in this blog:
https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm
A MoE model with 36B activated parameters and 132B total parameters, it's inference performance will act like a 90B dense model with 2000 prompt and 256 output tokenes. How can it always performs better than llama2-70B dense model? As the batchsize increases, it will perform better than llama2-70B dense model first, and will perform worse than llama2-70B dense model from batchszie 3 or 4, because it will load all the 132B parameters when more and more experts are activated.
Metadata
Metadata
Assignees
Labels
No labels