Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Max throughout for llama3.2 1B model #9680

Open
JunhaoWang opened this issue Oct 25, 2024 · 0 comments
Open

Max throughout for llama3.2 1B model #9680

JunhaoWang opened this issue Oct 25, 2024 · 0 comments
Labels

Comments

@JunhaoWang
Copy link

JunhaoWang commented Oct 25, 2024

benchmark that targets high througput small llm use case

my use case does not care about latency, but need extremely high throughput at minimum price (cheap t4 gpu or just cpu), i.e. run 1 billion requests every week (avg 100k input tokens, 20k output tokens per request). i didnt find any benchmark targeting this type of usecase, would love some pointers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant