Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Qwen v1.5 Tokenizer bug #107

Merged
merged 7 commits into from
Aug 2, 2024
Merged

Conversation

chenghuaWang
Copy link
Contributor

  1. Two packages, re2 and absl, have been integrated into the Qwen1.5 Demo, and the compilation has passed on the ARM platform.
  2. A feature has been added to BPE where it does not split word when encountering a Special Token.

@yirongjie yirongjie self-requested a review August 1, 2024 04:28
@chenghuaWang
Copy link
Contributor Author

chenghuaWang commented Aug 1, 2024

I tested this giant UnicodeData.cpp on 3 Compilers

  1. Clang 17.0.6 Self-build. Debug SUCCESS. Release FAILED to built mllm due to UnicodeData.cpp's const data is too large.
  2. GCC 12.3.0 Pre-build. Both SUCCESS
  3. Clang 18 provided by NDK. Both SUCCESS.

@yirongjie yirongjie merged commit 1b1c3cc into UbiquitousLearning:main Aug 2, 2024
1 check passed
@yirongjie yirongjie changed the title Fix: Qwen v1.5 Tokenizer bug fix: Qwen v1.5 Tokenizer bug Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants