Skip to content

Lyrcaxis/Llamba

Repository files navigation

LLaMbA (Large Language Model Batching Application)

LLaMbA is a minimalistic cross-platform batching engine/server for LLMs, powered by ASP.NET Core and LLamaSharp.

The engine's goal is to be able to serve multiple requests with small models as quick as possible, and it was made while having in mind its primary purposes of Serving, Classifying, and Generating Synthetic Data, within a minimal and extensible environment.

Why is it fast

LLaMbA introduces quick and customizable ways to sample, made possible by .NET's System.Numerics.Tensors and threading. The Out-Of-The-Box sampling is arguably not as extensive as llama.cpp's, but it serves its purposes nicely and it's quite faster (up to ~10x increasing with smaller model sizes).

In addition, it hosts a python tokenizer, and utilizes llama.cpp's token grouping features to reduce the total amount of tokens in the batch, by reusing tokens that share the same position in multiple sequences, reducing the total amount of tokens the model sees. This can further be taken advantage of during multiple classifications of the same prompt, where most tokens are the same but the classification purposes change.

What it isn't

While LLaMbA contains a basic Web UI for chatting with the LLM, it wasn't made to contain rich features and single-user session efficiency, but with ease-of-testing in mind. That said, the primary use of the Web UI is testing any imposed changes, custom samplers, or systems.

It also isn't an all-in-one & one-for-all deliverable; the user is expected to get hands-on and adjust code parts to their needs.

Who it's intended for

Anyone can use LLaMbA for Synthetic Data generation locally as it is, but for more advanced purposes like Serving or Classifying, the primary target audience is Developers that should create safeguards (e.g. auth, limits for max_tokens, moderation) and other systems to compliment the backend and take advantage of the high speeds.

Developers are encouraged to experiment and customize the engine to their specs.

Requirements

  • CUDA 12 or the backend of your choice (CUDA11, CUDA12, Vulkan, OpenCL, Metal, CPU).
  • .NET 8 SDK. Necessary for building and running the project.
  • Python (+ packages). After installing python, install the necessary packages:
pip install tokenizers uvicorn fastapi asyncio requests

Videos

The model used in the videos is LLama3.1-Instruct-8B-Q8, on a single RTX 4080, utilizing ~12GB of VRAM.

Batching Test (w/ flash attention)

About double the speed in comparison to using the llama.cpp sampler.
Llamba.Tests.mp4

WebUI - Made for Testing:

Chat UI supports basic back & forth functionality & message editing/deleting.
Llamba.Chat.mp4
Batches sent with Completion mode get passed without formatting, whereas Chat mode formats them to model's prompt format.
Llamba.Batch.Short.mp4
It's easy and fast to navigate a model to generate a specific json field from your specs.
Llamba.Batch.Json.mp4

General tips

Check out the General Guide and Example Usage for example usage of the API and a quick code tour.

Context Size can be increased in Model.cs to further increase throughput. The default parameters are for LLaMA3.1-8B-Q8 with ~12GB of VRAM.

Enabling Flash Attention will also increase generation throughput.

Supported models

LLaMbA supports all language models currently supported by llama.cpp.

About

Minimalistic batching application for LLMs

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published