By Habibullah Akbar.
Key features:
- Seamless integration with vision encoder. Along with selective RoPE for each image and text embedding sequence.
- Internal iteration, making deeper abstraction while keeping the same parameter count.
- GeGLU activation function, inspired by Gemma 2 models.
- Custom KV-caching, making sure each internal iteration has an independent KV-cache.
- BPE tokenizer based on KBBI.
- Grouped Query Attention.
- PyTorch Lightning implementation.
- DeepSpeed and ZeRO-3 integration. Automatically offload the memory overflow into CPU and NVMe.
- Finetuning scripts example with LoRA adapters, with and without quantization.
- Add BitNet implementation.
- Flash Attention implementation.
- Speech encoder.
- 2D and 3D RoPE.
- Diffusion Transformer for image detokenization.
- Influential token extraction from attention heatmap.
- Jupyter notebook example, both for training and finetuning.
- Dual license open-source for individuals, paid for commercial uses.
The iterable Transformer model, where the model can rethink its internal cognitive process with an internal confidence score as a guide. Akin of slow thinking mechanism. So this is the simple explanation of how it works:
- We put an adjustable parameter to handle internal looping, the default value is 1.
- If the loss value is high, this iteration is triggered, with max iterations set to 10.
- We train an independent layer to output a confidence score, trained by loss value from the main training process.
- When inference, both the next token and confidence scores are outputted and can determine how many iterations are needed for the current inference.
YouTube progress documentation playlist:
- First short brief (27 July 2024): https://youtu.be/NjK1BJyhrlI
Soon:
- Short-term memory injection.
- SageAttention implementation.
- Speech generation integration.
- Discrete Latent Representation."
- Grokfast
- Mamba2 block (?).
- Kolmogorov Arnold Network (KAN).
- Mixture of Experts block.
- Fast object detection integration, possibly YOLO or RT-DETR.
- OCR model integration.
- MIinference.
- Pre-train model integration, possibly Gemma 2 since it uses the same activation function.
- Citation to all of the papers used as references or inspirations.
UPDATE LICENSE: This software is dual-licensed under the terms of the GNU Affero General Public License (AGPL) and a commercial license. For commercial use, please contact Habibullah Akbar at akbar2habibullah.gmail to obtain a commercial license. Commercial use is defined as any use of the software for financial gain, including but not limited to, selling, licensing, or distributing the software as part of a product or service.