fix: Qwen v1.5 Tokenizer bug #107

chenghuaWang · 2024-07-31T09:30:29Z

Two packages, re2 and absl, have been integrated into the Qwen1.5 Demo, and the compilation has passed on the ARM platform.
A feature has been added to BPE where it does not split word when encountering a Special Token.

chenghuaWang · 2024-08-01T11:30:43Z

I tested this giant UnicodeData.cpp on 3 Compilers

Clang 17.0.6 Self-build. Debug SUCCESS. Release FAILED to built mllm due to UnicodeData.cpp's const data is too large.
GCC 12.3.0 Pre-build. Both SUCCESS
Clang 18 provided by NDK. Both SUCCESS.

chenghuaWang added 3 commits July 31, 2024 17:24

fix: qwen tokenizer bugs. Bpe Special Token method.

1ef3e5f

fix: typo

ab350f9

fix: set qwen 1.8B as default

f234229

yirongjie self-requested a review August 1, 2024 04:28

fix: remove re2 and abseil-cpp. Using Tokenizer refed from llama.cpp

33ca712

chenghuaWang force-pushed the main branch from 7285b2d to 33ca712 Compare August 1, 2024 11:24

chenghuaWang and others added 2 commits August 1, 2024 19:25

Merge branch 'UbiquitousLearning:main' into main

080da9b

fix: typo

69bdbfa

fix: typo

ddf3dfb

yirongjie approved these changes Aug 2, 2024

View reviewed changes

yirongjie merged commit 1b1c3cc into UbiquitousLearning:main Aug 2, 2024
1 check passed

yirongjie changed the title ~~Fix: Qwen v1.5 Tokenizer bug~~ fix: Qwen v1.5 Tokenizer bug Aug 16, 2024

Provide feedback