English | 中文
This section covers the architecture and training details of the model.
The motivation is to build a tokenizer that has higher compression rate on Chinese, English, Code data, and covers 100% Chinese characters. Since InternLM tokenizer has the best tokenization compression rate on our test set, we modify the InternLM tokenizer with following improvements:
- Train our own tokenizer with 100% Chinese common characters
- Calibrate the scores of our tokenizers based on InternLM tokenizer
- Supplement tokens in our trained tokenizer but not in InternLM tokenizer, like some Chinese common characters and user defined symbols
- Convert it into Llama tokenizer format
You can find more details in here about how we build our tokenizer. Our tokenizer is in here, which has 105789
tokens, with 100%
Chinese character coverage.
To reveal the effectiveness of our tokenizer, we compare it with some famous open-source LLMs' tokenizers by using following metrics:
-
Compression Rate: We compare two kinds of compression rate of tokenizer with tokenizers of some open-source LLMs:
- Byte per token compression rate
- Compared compression that measures the advantage over base Llama-2-7B tokenizer.
Please refer to this python script for more details. We evaluate tokenizers on Chinese, English, and Code test set for computing the compression rate, and the data source are listed as follows:
- Chinese: Skywork/ChineseDomainModelingEval
- English: test set of EleutherAI/pile
- Code: split from Pile-GitHub
These test data are publicly available at this link.
-
Chinese Character Coverage: a good Chinese LLM should cover more characters. In our work, we leverage vocab-coverage for computing the coverage of Chinese characters, which includes:
- First-level Chinese character Coverage (FCC) contains 3500 widely-used Chinese characters.
- Second-level Chinese character Coverage (SCC) contains 3000 Chinese characters.
- Third-level Chinese character Coverage (TCC) contains 1605 uncommon Chinese characters.
The experimental results are shown as follows:
Tokenizer | FCC | SCC | TCC | Byte per Token |
Comparied Compression |
---|---|---|---|---|---|
Chatglm-6b | 99.97% | 57.47% | 2.99% | 4.2911 | 0.5303 |
Chatglm2/3-6b | 100.00% | 77.83% | 13.89% | 4.0329 | 0.5642 |
Baichuan2-7b/14b | 100.00% | 99.8% | 86.48% | 4.1827 | 0.5440 |
Internlm-7b/20b | 100.00% | 65.93% | 5.67% | 4.3133 | 0.5276 |
Qwen-7b/14b/72b | 100.00% | 100.00% | 100.00% | 4.1326 | 0.5506 |
Llama-2-7b/13b/70b | 17.29% | 0.13% | 0.00% | 2.2755 | 1.00 |
Ours | 100.00% | 100.00% | 100.00% | 4.3143 | 0.5274 |
The experimental results reveal the advantages of our tokenizer over existing popular LLMs' tokenizers in compression rate (on Chinese, English, and Code), and the Chinese character coverage.
Our model leverages an advanced architecture designed to maximize efficiency and effectiveness in processing and generating text. Here are the key specifications:
Setting | Description |
---|---|
parameters | 1.4B |
attention | Grouped-query Attention (GQA) |
num layers | 22 |
num attention heads | 32 |
query groups | 4 |
hidden size | 2048 |
intermediate size | 5632 |
activation func | SiLU |
max sequence length | 2048 |
Our training process was meticulously planned and executed to ensure the model’s robustness and agility:
Setting | Description |
---|---|
block size | 2048 |
per-device batch size | 8 |
gradient accumulation steps | 16 |
num device | 8 |
total batch size | 2M tokens |
max learning rate | 5e-4 |
min learning rate | 5e-5 |
warmup steps | 2000 |
learning rate schedule | cosine |
We have modified the learning rate schedule, and more details can be found here.
Our model was trained on a meticulously curated selection of Chinese, English and Coding datasets, designed to foster wide-ranging language understanding:
Dataset | Split | Token (Billion) | Domain |
---|---|---|---|
ChineseWebText | Chinese | 142 | Chinese |
RefinedWeb | English | 128 | English |
Pile-arXiv | English | 38 | English |
Pile-Wiki | English | 12 | English |
Pile-GitHub | Code | 30 | Coding |
With the help of these two stragegies, we could achieve a high throughput of 16k tokens per GPU per second for training:
The setting for computing the throughput is:
- ZeRO-1
- block_size: 2048
- per_device_train_batch_size: 8
- gradient_accumulation_steps: 16
Settings | tokens per GPU per second |
---|---|
None | CUDA OOM |
Flash Attention 2 | 13k |
torch.compile | CUDA OOM |
Flash Attention 2 + torch.compile | 16k |
To cover more Chinese characters, our tokenizer is much larger than TinyLlama (105789 > 32000), leading to the lower throughput than TinyLlama (16k < 24k).
However, our throughput is comparable with TinyLlama, when the size of the tokenizer is the same (per_device_train_batch_size is set to 20 for this).
Settings | tokens per GPU per second |
---|---|
Flash Attention 2 + torch.compile | 24k |
Unlike TinyLlama that leverages some complex the operations fusion, we achieve this throughput solely based on torch.compile
and flash attention 2
.
Our pre-training is running on the 8x80G A100 server with 1T CPU memory. You could train your model with less GPU cards and memory with smaller batch size.
We have prepared the docker environment for pretraining, which has already incorporated the flash attention 2
and torch.compile
for efficient pre-training. The docker file is here.
Please refer to our data preparation guide for more details about how we pre-process the pretraining dataset, such as the dataset reformat and sufficient shuffle.
Start Training (train.sh)
Once you have prepared the running environment and the data, you could launch the pre-training process by:
set -ex
export WANDB_PROJECT=hammerllm
BASE_DIR="$PWD"
DATE=$(TZ=Asia/Shanghai date +'%Y%m%d%H%M%S')
CONFIG_PATH=${BASE_DIR}/configs/hammerllm
RUN_NAME=hammerllm_torch_compile_flash_attn_2
OUTPUT_DIR=${BASE_DIR}/checkpoint/${RUN_NAME}
DATA_SEED=3407
MODEL_SEED=3407
export CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
export TOKENIZERS_PARALLELISM=false
export WANDB_MODE=online
if [ ! -d ${OUTPUT_DIR} ]
then
mkdir -p ${OUTPUT_DIR}
fi
echo "Setting checkpoint directory to ${OUTPUT_DIR}"
MASTER_PORT=$(shuf -n 1 -i 60000-65535)
torchrun --nproc_per_node=8 --master_port ${MASTER_PORT} train.py \
--model_name_or_path ${CONFIG_PATH} \
--use_flash_attention_2 \
--use_torch_compile \
--train_file /path/to/your/tokenized/train/dataset \
--validation_files /path/to/your/tokenized/validation/dataset_1 /path/to/your/tokenized/validation/dataset_2 ... \
--preprocessing_num_workers 100 \
--block_size 2048 \
--do_train \
--do_eval \
--per_device_train_batch_size 8 \
--per_device_eval_batch_size 8 \
--gradient_accumulation_steps 16 \
--logging_steps 10 \
--max_steps 1000000 \
--warmup_steps 2000 \
--eval_steps 500 \
--save_steps 500 \
--evaluation_strategy steps \
--save_strategy steps \
--greater_is_better false \
--load_best_model_at_end false \
--ddp_find_unused_parameters false \
--remove_unused_columns false \
--save_total_limit 50 \
--learning_rate 5e-4 \
--lr_scheduler_type cosine \
--output_dir ${OUTPUT_DIR} \
--report wandb \
--run_name ${RUN_NAME} \
--bf16 \
--seed ${MODEL_SEED} \
--data_seed ${DATA_SEED} \
--deepspeed ${BASE_DIR}/configs/zero_1.json
This code is the launcing shell script for pre-training, the details of our pre-training codebase could be found in here.
The saved checkpoints contains some unusual key name introduced by torch.compile
, leading to the failed loading of model parameters.
We have provided the scripts in here, which calibrate these key name.
python convert_checkpoint.py --input-path <path of saved checkpoint> --output-path <path of converted transformers checkpoint>
Here is a code snippet to show you how to play with our model with HuggingFace transformers
:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'DataHammer/hammerllm-1.4b-222k'
text = '北京理工大学是'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
# if your device donot support the bfloat16, you could remove it
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
input_ids = tokenizer(text, return_tensors='pt').input_ids
output = model.generate(
input_ids=input_ids.cuda(),
max_length=min(int(len(input_ids) + 100), 1024),
do_sample=True,
top_p=0.95
).tolist()
generation = tokenizer.decode(output[0])
print(generation)