Real Performance versus llama-70B？

I have a problem about the inference data posted in this blog:
[https://www.databricks.com/blog/introducing-dbrx-new-state-art-open-llm](url)

A MoE model with 36B activated parameters and 132B total parameters, it's inference performance will act like a 90B dense model with 2000 prompt and 256 output tokenes.  How can it always performs better than llama2-70B dense model? As the batchsize increases, it will perform better than llama2-70B dense model first, and will perform worse than llama2-70B dense model from batchszie 3 or 4, because it will load all the 132B parameters when more and more experts are activated. 

![image](https://github.com/databricks/dbrx/assets/7777102/ac23047e-368c-485b-8157-78860ab777ad)


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Real Performance versus llama-70B？ #10

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Real Performance versus llama-70B？ #10

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions