HammerLLM🔨

English | 中文

✨Getting Start

This section covers the architecture and training details of the model.

📓Build Tokenizer

The motivation is to build a tokenizer that has higher compression rate on Chinese, English, Code data, and covers 100% Chinese characters. Since InternLM tokenizer has the best tokenization compression rate on our test set, we modify the InternLM tokenizer with following improvements:

Train our own tokenizer with 100% Chinese common characters
Calibrate the scores of our tokenizers based on InternLM tokenizer
Supplement tokens in our trained tokenizer but not in InternLM tokenizer, like some Chinese common characters and user defined symbols
Convert it into Llama tokenizer format

You can find more details in here about how we build our tokenizer. Our tokenizer is in here, which has 105789 tokens, with 100% Chinese character coverage.

Experimental Results

To reveal the effectiveness of our tokenizer, we compare it with some famous open-source LLMs' tokenizers by using following metrics:

Compression Rate: We compare two kinds of compression rate of tokenizer with tokenizers of some open-source LLMs:
- Byte per token compression rate
- Compared compression that measures the advantage over base Llama-2-7B tokenizer.
Please refer to this python script for more details. We evaluate tokenizers on Chinese, English, and Code test set for computing the compression rate, and the data source are listed as follows:
- Chinese: Skywork/ChineseDomainModelingEval
- English: test set of EleutherAI/pile
- Code: split from Pile-GitHub
These test data are publicly available at this link.
Chinese Character Coverage: a good Chinese LLM should cover more characters. In our work, we leverage vocab-coverage for computing the coverage of Chinese characters, which includes:
- First-level Chinese character Coverage (FCC) contains 3500 widely-used Chinese characters.
- Second-level Chinese character Coverage (SCC) contains 3000 Chinese characters.
- Third-level Chinese character Coverage (TCC) contains 1605 uncommon Chinese characters.

The experimental results are shown as follows:

Tokenizer	FCC	SCC	TCC	Byte per Token $\uparrow$	Comparied Compression $\downarrow$
Chatglm-6b	99.97%	57.47%	2.99%	4.2911	0.5303
Chatglm2/3-6b	100.00%	77.83%	13.89%	4.0329	0.5642
Baichuan2-7b/14b	100.00%	99.8%	86.48%	4.1827	0.5440
Internlm-7b/20b	100.00%	65.93%	5.67%	4.3133	0.5276
Qwen-7b/14b/72b	100.00%	100.00%	100.00%	4.1326	0.5506
Llama-2-7b/13b/70b	17.29%	0.13%	0.00%	2.2755	1.00
Ours	100.00%	100.00%	100.00%	4.3143	0.5274

The experimental results reveal the advantages of our tokenizer over existing popular LLMs' tokenizers in compression rate (on Chinese, English, and Code), and the Chinese character coverage.

⚙️Train Model

⚒Architecture

Our model leverages an advanced architecture designed to maximize efficiency and effectiveness in processing and generating text. Here are the key specifications:

Setting	Description
parameters	1.4B
attention	Grouped-query Attention (GQA)
num layers	22
num attention heads	32
query groups	4
hidden size	2048
intermediate size	5632
activation func	SiLU
max sequence length	2048

⚙Training Hyper-paramter

Our training process was meticulously planned and executed to ensure the model’s robustness and agility:

Setting	Description
block size	2048
per-device batch size	8
gradient accumulation steps	16
num device	8
total batch size	2M tokens
max learning rate	5e-4
min learning rate	5e-5
warmup steps	2000
learning rate schedule	cosine

We have modified the learning rate schedule, and more details can be found here.

🗂Training Corpus

Our model was trained on a meticulously curated selection of Chinese, English and Coding datasets, designed to foster wide-ranging language understanding:

Dataset	Split	Token (Billion)	Domain
ChineseWebText	Chinese	142	Chinese
RefinedWeb	English	128	English
Pile-arXiv	English	38	English
Pile-Wiki	English	12	English
Pile-GitHub	Code	30	Coding

⏩Acceleration Strategy

With the help of these two stragegies, we could achieve a high throughput of 16k tokens per GPU per second for training:

Flash Attention 2
torch.compile

The setting for computing the throughput is:

ZeRO-1
block_size: 2048
per_device_train_batch_size: 8
gradient_accumulation_steps: 16

Settings	tokens per GPU per second
None	CUDA OOM
Flash Attention 2	13k
torch.compile	CUDA OOM
Flash Attention 2 + torch.compile	16k

To cover more Chinese characters, our tokenizer is much larger than TinyLlama (105789 > 32000), leading to the lower throughput than TinyLlama (16k < 24k).

However, our throughput is comparable with TinyLlama, when the size of the tokenizer is the same (per_device_train_batch_size is set to 20 for this).

Settings	tokens per GPU per second
Flash Attention 2 + torch.compile	24k

Unlike TinyLlama that leverages some complex the operations fusion, we achieve this throughput solely based on torch.compile and flash attention 2.

🖥Hardware Requirement

Our pre-training is running on the 8x80G A100 server with 1T CPU memory. You could train your model with less GPU cards and memory with smaller batch size.

Pretrain

Environment Setup

We have prepared the docker environment for pretraining, which has already incorporated the flash attention 2 and torch.compile for efficient pre-training. The docker file is here.

Data Preparation

Please refer to our data preparation guide for more details about how we pre-process the pretraining dataset, such as the dataset reformat and sufficient shuffle.

Start Training (train.sh)

Once you have prepared the running environment and the data, you could launch the pre-training process by:

set -ex
export WANDB_PROJECT=hammerllm
BASE_DIR="$PWD"
DATE=$(TZ=Asia/Shanghai date +'%Y%m%d%H%M%S')
CONFIG_PATH=${BASE_DIR}/configs/hammerllm
RUN_NAME=hammerllm_torch_compile_flash_attn_2
OUTPUT_DIR=${BASE_DIR}/checkpoint/${RUN_NAME}

DATA_SEED=3407
MODEL_SEED=3407

export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export TOKENIZERS_PARALLELISM=false
export WANDB_MODE=online

if [ ! -d ${OUTPUT_DIR} ]
then
  mkdir -p ${OUTPUT_DIR}
fi
echo "Setting checkpoint directory to ${OUTPUT_DIR}"

MASTER_PORT=$(shuf -n 1 -i 60000-65535)
torchrun --nproc_per_node=8 --master_port ${MASTER_PORT} train.py \
  --model_name_or_path ${CONFIG_PATH} \
  --use_flash_attention_2 \
  --use_torch_compile \
  --train_file /path/to/your/tokenized/train/dataset \
  --validation_files /path/to/your/tokenized/validation/dataset_1 /path/to/your/tokenized/validation/dataset_2 ... \
  --preprocessing_num_workers 100 \
  --block_size 2048 \
  --do_train \
  --do_eval \
  --per_device_train_batch_size 8 \
  --per_device_eval_batch_size 8 \
  --gradient_accumulation_steps 16 \
  --logging_steps 10 \
  --max_steps 1000000 \
  --warmup_steps 2000 \
  --eval_steps 500 \
  --save_steps 500 \
  --evaluation_strategy steps \
  --save_strategy steps \
  --greater_is_better false \
  --load_best_model_at_end false \
  --ddp_find_unused_parameters false \
  --remove_unused_columns false \
  --save_total_limit 50 \
  --learning_rate 5e-4 \
  --lr_scheduler_type cosine \
  --output_dir ${OUTPUT_DIR} \
  --report wandb \
  --run_name ${RUN_NAME} \
  --bf16 \
  --seed ${MODEL_SEED} \
  --data_seed ${DATA_SEED} \
  --deepspeed ${BASE_DIR}/configs/zero_1.json

This code is the launcing shell script for pre-training, the details of our pre-training codebase could be found in here.

Convert Checkpoints

The saved checkpoints contains some unusual key name introduced by torch.compile, leading to the failed loading of model parameters. We have provided the scripts in here, which calibrate these key name.

python convert_checkpoint.py --input-path <path of saved checkpoint> --output-path <path of converted transformers checkpoint>

Inference

Example Code

Here is a code snippet to show you how to play with our model with HuggingFace transformers:

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

model_name = 'DataHammer/hammerllm-1.4b-222k'
text = '北京理工大学是'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
# if your device donot support the bfloat16, you could remove it
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)

input_ids = tokenizer(text, return_tensors='pt').input_ids
output = model.generate(
    input_ids=input_ids.cuda(),
    max_length=min(int(len(input_ids) + 100), 1024),
    do_sample=True,
    top_p=0.95
).tolist()

generation = tokenizer.decode(output[0])
print(generation)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

REPRODUCE.md

REPRODUCE.md

HammerLLM🔨

Table of Contents

✨Getting Start

📓Build Tokenizer

Experimental Results

⚙️Train Model

⚒Architecture

⚙Training Hyper-paramter

🗂Training Corpus

⏩Acceleration Strategy

🖥Hardware Requirement

Pretrain

Environment Setup

Data Preparation

Start Training (train.sh)

Convert Checkpoints

Inference

Example Code

Files

REPRODUCE.md

Latest commit

History

REPRODUCE.md

File metadata and controls

HammerLLM🔨

Table of Contents

✨Getting Start

📓Build Tokenizer

Experimental Results

⚙️Train Model

⚒Architecture

⚙Training Hyper-paramter

🗂Training Corpus

⏩Acceleration Strategy

🖥Hardware Requirement

Pretrain

Environment Setup

Data Preparation

Start Training (train.sh)

Convert Checkpoints

Inference

Example Code