LLaMbA (Large Language Model Batching Application)

LLaMbA is a minimalistic cross-platform batching engine/server for LLMs, powered by ASP.NET Core and LLamaSharp.

The engine's goal is to be able to serve multiple requests with small models as quick as possible, and it was made while having in mind its primary purposes of Serving, Classifying, and Generating Synthetic Data, within a minimal and extensible environment.

Why is it fast

LLaMbA introduces quick and customizable ways to sample, made possible by .NET's System.Numerics.Tensors and threading. The Out-Of-The-Box sampling is arguably not as extensive as llama.cpp's, but it serves its purposes nicely and it's quite faster (up to ~10x increasing with smaller model sizes).

In addition, it hosts a python tokenizer, and utilizes llama.cpp's token grouping features to reduce the total amount of tokens in the batch, by reusing tokens that share the same position in multiple sequences, reducing the total amount of tokens the model sees. This can further be taken advantage of during multiple classifications of the same prompt, where most tokens are the same but the classification purposes change.

What it isn't

While LLaMbA contains a basic Web UI for chatting with the LLM, it wasn't made to contain rich features and single-user session efficiency, but with ease-of-testing in mind. That said, the primary use of the Web UI is testing any imposed changes, custom samplers, or systems.

It also isn't an all-in-one & one-for-all deliverable; the user is expected to get hands-on and adjust code parts to their needs.

Who it's intended for

Anyone can use LLaMbA for Synthetic Data generation locally as it is, but for more advanced purposes like Serving or Classifying, the primary target audience is Developers that should create safeguards (e.g. auth, limits for max_tokens, moderation) and other systems to compliment the backend and take advantage of the high speeds.

Developers are encouraged to experiment and customize the engine to their specs.

Requirements

CUDA 12 or the backend of your choice (CUDA11, CUDA12, Vulkan, OpenCL, Metal, CPU).
.NET 8 SDK. Necessary for building and running the project.
Python (+ packages). After installing python, install the necessary packages:

pip install tokenizers uvicorn fastapi asyncio requests

Videos

The model used in the videos is LLama3.1-Instruct-8B-Q8, on a single RTX 4080, utilizing ~12GB of VRAM.

Batching Test (w/ flash attention)

About double the speed in comparison to using the llama.cpp sampler.

Llamba.Tests.mp4

WebUI - Made for Testing:

Chat UI supports basic back & forth functionality & message editing/deleting.

Llamba.Chat.mp4

Batches sent with Completion mode get passed without formatting, whereas Chat mode formats them to model's prompt format.

Llamba.Batch.Short.mp4

It's easy and fast to navigate a model to generate a specific json field from your specs.

Llamba.Batch.Json.mp4

General tips

Check out the General Guide and Example Usage for example usage of the API and a quick code tour.

Context Size can be increased in Model.cs to further increase throughput. The default parameters are for LLaMA3.1-8B-Q8 with ~12GB of VRAM.

Enabling Flash Attention will also increase generation throughput.

Supported models

LLaMbA supports all language models currently supported by llama.cpp.

see InferenceFormat.cs to add your own prompt format.
and Tokenizer.cs for adding a tokenizer. It's easy!

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
Batching		Batching
Properties		Properties
Sampling		Sampling
Server		Server
Tests		Tests
Tokenization		Tokenization
wwwroot		wwwroot
.gitattributes		.gitattributes
.gitignore		.gitignore
InferenceFormat.cs		InferenceFormat.cs
LICENSE		LICENSE
Llamba.csproj		Llamba.csproj
Llamba.sln		Llamba.sln
LlambaExtensions.cs		LlambaExtensions.cs
Model.cs		Model.cs
Program.cs		Program.cs
README.md		README.md
appsettings.Development.json		appsettings.Development.json
appsettings.json		appsettings.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLaMbA (Large Language Model Batching Application)

Why is it fast

What it isn't

Who it's intended for

Requirements

Videos

The model used in the videos is LLama3.1-Instruct-8B-Q8, on a single RTX 4080, utilizing ~12GB of VRAM.

Batching Test (w/ flash attention)

About double the speed in comparison to using the llama.cpp sampler.

WebUI - Made for Testing:

Chat UI supports basic back & forth functionality & message editing/deleting.

Batches sent with Completion mode get passed without formatting, whereas Chat mode formats them to model's prompt format.

It's easy and fast to navigate a model to generate a specific json field from your specs.

General tips

Supported models

About

Releases

Packages

Languages

License

Lyrcaxis/Llamba

Folders and files

Latest commit

History

Repository files navigation

LLaMbA (Large Language Model Batching Application)

Why is it fast

What it isn't

Who it's intended for

Requirements

Videos

The model used in the videos is LLama3.1-Instruct-8B-Q8, on a single RTX 4080, utilizing ~12GB of VRAM.

Batching Test (w/ flash attention)

About double the speed in comparison to using the llama.cpp sampler.

WebUI - Made for Testing:

Chat UI supports basic back & forth functionality & message editing/deleting.

Batches sent with Completion mode get passed without formatting, whereas Chat mode formats them to model's prompt format.

It's easy and fast to navigate a model to generate a specific json field from your specs.

General tips

Supported models

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages