BitNet CUDA Implementation

This repository contains a high-performance CUDA implementation of BitNet, a novel neural network architecture that uses extreme quantization (1.58-bit weights) for efficient inference and training.

Prerequisites

CUDA Toolkit 11.0 or later
CMake 3.18 or later
GCC 7.0 or later
Python 3.6 or later (for PyTorch integration)
PyTorch 1.7.0 or later (for PyTorch integration)
NVIDIA GPU with compute capability 7.0 or higher

Installation

Clone the repository:

git clone https://github.com/yourusername/bitnet-cuda.git
cd bitnet-cuda

Install the required Python packages:
```
pip install -r requirements.txt
```
Install CUDA Toolkit: Follow the instructions on the NVIDIA CUDA Installation Guide for your operating system.

Project Structure

bitnet-cuda/
├── include/
│   ├── bitnet_cuda.h
│   ├── memory_manager.h
│   ├── auto_tuning.h
│   ├── bitnet_cuda.h
│   ├── bitnet_types.h
│   ├── memory_management.h
│   ├── multi_gpu.h
│   ├── pytorch_extension.h
│   ├── utils.h
│   └── kernels.h
├── src/
│   ├── bitnet_cuda.cu
│   ├── memory_management.cu
│   ├── multi_gpu.cu
│   ├── auto_tuner.cu
│   ├── utils.cu
│   ├── kernels.cu
│   └── kernels/
│       ├── matmul_kernels.cu
│       ├── normalization_kernels.cu
│       ├── linear_kernels.cu
│       ├── persistent_kernels.cu
│       ├── quantization_kernels.cu
│       └── activation_kernels.cu
├── tests/
    ├── CMakeLists.txt
│   ├── test_bitnet_cuda.cu
│   ├── test_memory_manager.cu
│   └── test_kernels.cu
├── pytorch_extension/
│   ├── setup.py
│   └── bitnet_cuda.cpp
├── benchmark/
├── examples/
├── docs/
    ├── TestGuide.md

├── CMakeLists.txt
├── requirements.txt
└── README.md

Building the Project

Create a build directory:
```
mkdir build && cd build
```
Configure the project with CMake:
```
cmake ..
```
Build the project:
```
make -j$(nproc)
```

This will build the main BitNet CUDA library and the test executables.

Running Tests

After building the project, you can run the tests:

ctest --verbose

Or run individual test executables:

./tests/test_bitnet_cuda
./tests/test_memory_manager
./tests/test_kernels

Profiling

To profile the BitNet CUDA implementation, we use NVIDIA's Nsight Compute and Nsight Systems tools.

Profile with Nsight Compute:

ncu --set full ./tests/test_bitnet_cuda

Profile with Nsight Systems:

nsys profile --stats=true ./tests/test_bitnet_cuda

Analyze the results using the Nsight Compute and Nsight Systems GUI tools.

PyTorch Integration

To build the PyTorch extension:

Navigate to the pytorch_extension directory:
```
cd pytorch_extension
```
Build the extension:
```
python setup.py install
```
Test the PyTorch extension:
```
python -m unittest discover tests
```

Usage

Here's a basic example of how to use the BitNet CUDA implementation:

#include "bitnet_cuda.h"

int main() {
    BitNetCUDA bitnet(2);  // Create BitNetCUDA instance with 2 GPUs
    
    // Allocate model parameters
    bitnet.allocate_model_parameters(batch_size, seq_length, hidden_size, num_layers);
    
    // Tune kernels (optional)
    bitnet.tune_kernels();
    
    // Create computation graph
    bitnet.create_computation_graph();
    
    // Run forward pass
    bitnet.run_computation_graph();
    
    return 0;
}

For PyTorch integration:

import torch
from bitnet_cuda import BitNetCUDAFunction

# Create input tensor
input_tensor = torch.randn(batch_size, seq_length, hidden_size, device='cuda')

# Create BitNet layer
bitnet_layer = BitNetCUDAFunction.apply

# Run forward pass
output = bitnet_layer(input_tensor)

Contributing

Fork the repository
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

License

This project is licensed under the MIT License - see the LICENSE file for details.



## Multi-GPU Support

BitNet CUDA supports multi-GPU execution for large models and datasets. To use multiple GPUs:

1. Specify the number of GPUs when creating the BitNetCUDA instance:
   ```cpp
   BitNetCUDA bitnet(num_gpus);

The implementation will automatically distribute the workload across available GPUs.
For PyTorch integration, use nn.DataParallel or nn.DistributedDataParallel as usual.

Configuration Options

BitNet CUDA supports various configuration options:

--num-gpus: Number of GPUs to use (default: 1)
--batch-size: Batch size for training/inference (default: 32)
--hidden-size: Hidden size of the model (default: 768)
--num-layers: Number of layers in the model (default: 12)
--quantization-bits: Number of bits for quantization (default: 1.58)

For a complete list of options, run:

./bitnet_benchmark --help

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BitNet CUDA Implementation

Table of Contents

Prerequisites

Installation

Project Structure

Building the Project

Running Tests

Profiling

PyTorch Integration

Usage

Contributing

License

Configuration Options

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cmake		cmake
docs		docs
examples		examples
include		include
pytorch_extension		pytorch_extension
src		src
tests		tests
CMakeLists.txt		CMakeLists.txt
README.MD		README.MD
requirements.txt		requirements.txt
setup.py		setup.py
test.sh		test.sh

thumper-ai/bitnet_cuda

Folders and files

Latest commit

History

Repository files navigation

BitNet CUDA Implementation

Table of Contents

Prerequisites

Installation

Project Structure

Building the Project

Running Tests

Profiling

PyTorch Integration

Usage

Contributing

License

Configuration Options

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages