English | 中文
Welcome to the pre-training repository for our small-size Large Language Model (sLLM) with 1.4 billion parameters, leveraged on the Llama 2 architecture. Our sLLM exhibits advantages in following features:
- Llama-2 Compatible LLM for Chinese and English🦙
- Simple yet Efficient Training Codebase🚀
- Complete Open-Source: model weights, environment, codebase, and hyper-parameters🔗
- Comparable performance with advanced sLLMs with the similar model scale🥇
- Tokenizer with highest compression rate and 100% Chinese character coverage🏆
- [2024.4.2] 🎉🎉🎉Release all the resources about our HammerLLM🔨, including model weights, docker environment, and training codebase
- Continue to optimize our models on more Chinese, English, and Code tokens
- Pretrain on more Code data
- HammerLLM🔨
Here is a code snippet to show you how to play with our model with HuggingFace transformers
:
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = 'DataHammer/hammerllm-1.4b-222k'
text = '北京理工大学是'
tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=False)
# if your device donot support the bfloat16, you could remove it
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16)
input_ids = tokenizer(text, return_tensors='pt').input_ids
output = model.generate(
input_ids=input_ids.cuda(),
max_length=min(int(len(input_ids) + 100), 1024),
do_sample=True,
top_p=0.95
).tolist()
generation = tokenizer.decode(output[0])
print(generation)
Here, we show some cases about text completion by our model on Chinese, English, and Code setting.
Prompt: 北京理工大学是
Generation: <s> 北京理工大学是教育部直属全国重点大学,是我国“211工程”和“985工程”重点建设的教育部直属理工类全国重点大学,直属于工业 和信息化部,是全国首批正式获得博士学位和博士后研究资格的高校之一,首批获得公安专业硕士、公安专业博士和司法专业博士学位的22所高校之 一,是博士学位授予单位、法学(一级学科)和公安学(一级学科)博士授权高校,是一所以工为主、理工结合、多学科协调发展的全国重点大学。
北京理工大学是中央部门所属高校,中央共建高校,教育部直属高校。
一、大学院(院系)
北京理工大学现有48个博士学位授权一级学科,涵盖17个学科门类。
二、专业院系
北京理工大学设有11个一级学科博士后科研流动站
三、研究生院
北京理工大学设有29个研究所,其中国家重点实验室2个; 博士后科研流动站18个;博士后科研工作站3个。
北京理工大学研究生院招生专业:</s>
Prompt: 小米公司是一家
Generation: <s> 小米公司是一家专注于智能硬件和电子产品研发的互联网公司,公司产品包括智能手机、笔记本电脑、路由器等。小米公司一直专 注于技术创新和卓越的体验,致力于为用户提供更好的产品。
小米公司拥有一支经验丰富的研发团队,在产品开发、用户体验等方面积累了丰富的经验。公司研发团队中的大部分成员来自微软、谷歌等知名科技 公司,对互联网和移动互联网技术有着深刻理解。此外,小米公司还拥有专业的开发团队和良好的技术支持。
小米公司的核心产品之一是小米手机。小米公司的目标是让用户可以在手机中轻松分享图片、视频、音乐和音乐文件等。小米公司提供了一种简单易 用的移动媒体解决方案,让用户可以利用智能手机来轻松享受音乐、照片、视频和游戏等移动媒体内容。
小米手机还支持多种功能,包括智能语音助手、远程控制、远程定位、移动支付等。小米手机还支持各种不同的应用和服务,包括社交媒体、游戏、 视频分享、本地化服务等。
小米公司还拥有丰富的硬件产品线,包括智能手机、笔记本电脑、平板电脑、电视等。小米公司还推出了许多极具竞争力的产品,如智能手环、智能 手表等。</s>
Prompt: Google is
Generation: <s> Google is to deliver its latest Pixel phone to AT&T, instead of the original, cheaper Pixel 3a.
AT&T US is expected to unveil the Pixel 5a on the 8th of February, or in the evening of Friday the 12th next year.
However, it appears Google has announced the Pixel 5a as the second Pixel phone to be delivered to AT&T, rather than the original Pixel 3a.
According to a blog post published by Pixel US on 26th October, the Pixel 5a will be available on the 9th of March with a price of $599.99.
Google has also promised the Pixel 5a will be the first Pixel phone sold in the U.S., US.
In the announcement made the company is also hinting that the Pixel 5a will be available on Sprint's network, meaning the Pixel 5a won't be delivered to T-Mobile US.
A couple of months ago, the Google Pixel 5a, then known as the Pixel 4a, was revealed to the world on the 22nd of October alongside an invite for beta program for the Pixel 5a.
The Pixel 5a is believed to be the largest smartphone Pixel has ever produced.</s>
Prompt: This is a code snippet for sorting a list of number in Python language with python in-built function `sorted`:
"""python
number_list = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
Generation: <s> This is a code snippet for sorting a list of number in Python language with python in-built function `sorted`:
"""python
number_list = [10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0]
print(sorted(number_list))
"""
[`sorted`](https://docs.python.org/2/library/ordereddict.html) is a function that helps you sort a list of values in Python language.
For the list of items like number, this function is also used to sort the list.
</s>
We publicly release our checkpoints on Huggingface. Welcome to test and utilize our models!
Epoch | Step | Pretrained tokens | Sequence Length | HuggingFace checkpoint |
---|---|---|---|---|
1 | 25k | 50B | 2048 | DataHammer/hammerllm-1.4b-25k |
1 | 50k | 100B | 2048 | DataHammer/hammerllm-1.4b-50k |
1 | 75k | 150B | 2048 | DataHammer/hammerllm-1.4b-75k |
1 | 100k | 200B | 2048 | DataHammer/hammerllm-1.4b-100k |
1 | 125k | 250B | 2048 | DataHammer/hammerllm-1.4b-125k |
1 | 150k | 300B | 2048 | DataHammer/hammerllm-1.4b-150k |
1 | 172.5k | 345B | 2048 | DataHammer/hammerllm-1.4b-172k |
2 | 197k | 395B | 2048 | DataHammer/hammerllm-1.4b-197k |
2 | 222k | 445B | 2048 | DataHammer/hammerllm-1.4b-222k |
For base models that are not instruct-tuned, it is important to show their perplexity on the latest fresh test dataset, which reflects their prediction capability without training data leakage. To achieve this goal, we compare the perplexity of our model and some famous sLLMs on famous Chinese and English latest test set.
Our perplexity calculation follows the settings of Skywork, please refer to this link for more details about their implementation.
Our model's performance on perplexity (ppl) testing is second only to the Qwen-1.5-1.8B model, outperforming Gemma-2B, InterLM2-1.8B, and MiniCPM-2B (350k). Given the fact that the our model size is smaller (1.4B < 1.8B), these experimental results are surprising.
Chinese Corpus (Skywork-ChineseDomainModelingEval)
Tokens | Tech | Movie | Government | Game | Finance | General | Avg. | |
---|---|---|---|---|---|---|---|---|
Gemma-2B | 6T | 71.92 | 271.95 | 93.16 | 122.59 | 46.53 | 74.93 | 92.92 |
InterLM2-1.8B | - | 19.20 | 36.88 | 7.18 | 29.67 | 9.51 | 11.81 | 16.03 |
MiniCPM-2B(350k) | - | 17.32 | 36.44 | 7.00 | 26.59 | 7.92 | 10.26 | 14.56 |
Qwen-1.5(1.8B) | - | 16.88 | 33.39 | 6.63 | 25.44 | 7.43 | 9.77 | 13.80 |
Qwen-1.5(0.5B) | - | 20.70 | 41.59 | 8.24 | 31.64 | 9.41 | 11.98 | 17.13 |
Ours | 50B | 19.59 | 44.58 | 7.48 | 33.55 | 8.10 | 11.31 | 16.49 |
Ours | 100B | 18.23 | 40.26 | 7.11 | 30.80 | 7.42 | 10.47 | 15.23 |
Ours | 150B | 17.59 | 38.86 | 6.91 | 29.76 | 7.19 | 10.20 | 14.75 |
Ours | 200B | 17.24 | 38.03 | 6.84 | 29.01 | 7.05 | 10.04 | 14.48 |
Ours | 250B | 17.08 | 37.53 | 6.81 | 28.63 | 6.97 | 9.91 | 14.32 |
Ours | 300B | 16.92 | 37.17 | 6.82 | 28.37 | 6.89 | 9.82 | 14.21 |
Ours | 345B | 16.79 | 36.89 | 6.73 | 28.11 | 6.84 | 9.75 | 14.09 |
Ours(epoch-2) | 395B | 16.71 | 36.64 | 6.60 | 28.01 | 6.85 | 9.72 | 14.00 |
Ours(epoch-2) | 445B | 16.67 | 36.43 | 6.68 | 27.86 | 6.75 | 9.65 | 13.95 |
English Corpus (RealTimeData-Latest Collection)
Tokens | arxiv-latest | bbc-latest | github-latest | wikitext-latest | Avg. | |
---|---|---|---|---|---|---|
TinyLlama-1.1B | 3T | 8.21 | 8.27 | 5.90 | 7.38 | 7.37 |
Gemma-2B | 6T | 28.96 | 37.83 | 22.10 | 28.11 | 28.72 |
Phi-1.5(1.3B) | 30B * 5 | 13.08 | 14.07 | 10.42 | 15.16 | 13.06 |
InterLM2-1.8B | - | 7.84 | 8.75 | 6.87 | 8.52 | 7.96 |
MiniCPM-2B(350k) | - | 10.81 | 9.15 | nan | 8.08 | 9.28 |
Qwen-1.5(1.8B) | - | 8.70 | 8.92 | nan | 8.75 | 8.79 |
Qwen-1.5(0.5B) | - | 10.42 | 10.93 | nan | 10.58 | 10.64 |
Ours | 50B | 10.67 | 13.19 | 9.14 | 10.47 | 10.77 |
Ours | 100B | 9.87 | 11.97 | 8.46 | 9.71 | 9.93 |
Ours | 150B | 9.57 | 11.63 | 8.16 | 9.38 | 9.61 |
Ours | 200B | 9.41 | 11.43 | 8.06 | 9.21 | 9.45 |
Ours | 250B | 9.39 | 11.21 | 7.96 | 9.13 | 9.35 |
Ours | 300B | 9.79 | 11.24 | 7.90 | 9.03 | 9.41 |
Ours | 345B | 9.59 | 11.11 | 7.86 | 9.04 | 9.33 |
Ours(epoch-2) | 395B | 9.65 | 11.06 | 7.86 | 9.01 | 9.32 |
Ours(epoch-2) | 445B | 9.34 | 11.07 | 7.86 | 9.04 | 9.26 |
NOTE: Average PPL of Qwen-1.5 and MiniCPM models have excluded GitHub dataset for NaN value.
To reveal the effectiveness of our tokenizer, we compare it with some famous open-source LLMs' tokenizers by using following metrics:
-
Compression Rate: We compare two kinds of compression rate of tokenizer with tokenizers of some open-source LLMs:
- Byte per token compression rate
- Compared compression that measures the advantage over base Llama-2-7B tokenizer.
-
Chinese Character Coverage: a good Chinese LLM should cover more characters. In our work, we leverage vocab-coverage for computing the coverage of Chinese characters, which includes:
- First-level Chinese character Coverage (FCC) contains 3500 widely-used Chinese characters.
- Second-level Chinese character Coverage (SCC) contains 3000 Chinese characters.
- Third-level Chinese character Coverage (TCC) contains 1605 uncommon Chinese characters.
The experimental results are shown as follows:
Tokenizer | FCC | SCC | TCC | Byte per Token |
Comparied Compression |
---|---|---|---|---|---|
Chatglm-6b | 99.97% | 57.47% | 2.99% | 4.2911 | 0.5303 |
Chatglm2/3-6b | 100.00% | 77.83% | 13.89% | 4.0329 | 0.5642 |
Baichuan2-7b/14b | 100.00% | 99.8% | 86.48% | 4.1827 | 0.5440 |
Internlm-7b/20b | 100.00% | 65.93% | 5.67% | 4.3133 | 0.5276 |
Qwen-7b/14b/72b | 100.00% | 100.00% | 100.00% | 4.1326 | 0.5506 |
Llama-2-7b/13b/70b | 17.29% | 0.13% | 0.00% | 2.2755 | 1.00 |
Ours | 100.00% | 100.00% | 100.00% | 4.3143 | 0.5274 |
The experimental results reveal the advantages of our tokenizer over existing popular LLMs' tokenizers in compression rate (on Chinese, English, and Code), and the Chinese character coverage. Please refer to REPRODUCE.md for more details about this experiment.
With the help of these two stragegies, we could achieve a high throughput of 16k tokens per GPU per second for training:
The setting for computing the throughput is:
- ZeRO-1
- block_size: 2048
- per_device_train_batch_size: 8
- gradient_accumulation_steps: 16
Settings | tokens per GPU per second |
---|---|
None | CUDA OOM |
Flash Attention 2 | 13k |
torch.compile | CUDA OOM |
Flash Attention 2 + torch.compile | 16k |
To cover more Chinese characters, our tokenizer is much larger than TinyLlama (105789 > 32000), leading to the lower throughput than TinyLlama (16k < 24k).
However, our throughput is comparable with TinyLlama, when the size of the tokenizer is the same (per_device_train_batch_size is set to 20 for this).
Settings | tokens per GPU per second |
---|---|
Flash Attention 2 + torch.compile | 24k |
Unlike TinyLlama that leverages some complex the operations fusion, we achieve this throughput solely based on torch.compile
and flash attention 2
.
So far, we have optimized our model on over 400B Chinese, English, and Code tokens. Perplexity results in Chinese and English test set demonstrate that our model has not yet converged, hence we are currently continuing to optimize it.
We will periodically upload and make our checkpoints publicly available for analysis and research by the open-source community.
The trend of loss changes during the training process is shown in the following figure:
The trends of standardized perplexity variation based on Skywork series models on the Skywork-ChineseDomainModelingEval and RealTimeData-Latest Collection datasets are:
Chinese PPL |
English PPL |
One of contriution is that we have publicly release the steps to train our model from scratch. You can refer to this file for more details about our work.
@software{Ziao_HammerLLM_2024,
author = {Ziao, Ma and Tian, Lan and Yizhe, Yang and Yong, Hu},
month = apr,
title = {{HammerLLM}},
url = {https://github.com/Academic-Hammer/HammerLLM},
version = {1.0.0},
year = {2024}
}
You can contact us through the email address:
This project is released under the MIT license.
We hope this documentation is helpful for understanding and utilizing our LLM model effectively. For further questions or support, please open an issue on this repository.