chore(deps): bump tokenizers from 0.19.1 to 0.20.1 #339

dependabot · 2024-10-10T18:01:26Z

Bumps tokenizers from 0.19.1 to 0.20.1.

Release notes

Release v0.20.1

What's Changed

The most awaited offset issue with Llama is fixed 🥳

Update README.md by @ArthurZucker in huggingface/tokenizers#1608

fix benchmark file link by @152334H in huggingface/tokenizers#1610

Bump actions/download-artifact from 3 to 4.1.7 in /.github/workflows by @dependabot in huggingface/tokenizers#1626

[ignore_merges] Fix offsets by @ArthurZucker in huggingface/tokenizers#1640

Bump body-parser and express in /tokenizers/examples/unstable_wasm/www by @dependabot in huggingface/tokenizers#1629

Bump serve-static and express in /tokenizers/examples/unstable_wasm/www by @dependabot in huggingface/tokenizers#1630

Bump send and express in /tokenizers/examples/unstable_wasm/www by @dependabot in huggingface/tokenizers#1631

Bump webpack from 5.76.0 to 5.95.0 in /tokenizers/examples/unstable_wasm/www by @dependabot in huggingface/tokenizers#1641

Fix documentation build by @ArthurZucker in huggingface/tokenizers#1642

style: simplify string formatting for readability by @hamirmahal in huggingface/tokenizers#1632

New Contributors

@152334H made their first contribution in huggingface/tokenizers#1610

@hamirmahal made their first contribution in huggingface/tokenizers#1632

Full Changelog: huggingface/tokenizers@v0.20.0...v0.20.1

Release v0.20.0: faster encode, better python support

Release v0.20.0

This release is focused on performances and user experience.

Performances:

First off, we did a bit of benchmarking, and found some place for improvement for us! With a few minor changes (mostly #1587) here is what we get on Llama3 running on a g6 instances on AWS https://github.com/huggingface/tokenizers/blob/main/bindings/python/benches/test_tiktoken.py :

Python API

We shipped better deserialization errors in general, and support for __str__ and __repr__ for all the object. This allows for a lot easier debugging see this:
>>> from tokenizers import Tokenizer;
>>> tokenizer = Tokenizer.from_pretrained("bert-base-uncased");
>>> print(tokenizer)
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, ...}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, ...}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, "[unused2]":3, "[unused3]":4, ...}))
>>> tokenizer
Tokenizer(version="1.0", truncation=None, padding=None, added_tokens=[{"id":0, "content":"[PAD]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":100, "content":"[UNK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":101, "content":"[CLS]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":102, "content":"[SEP]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}, {"id":103, "content":"[MASK]", "single_word":False, "lstrip":False, "rstrip":False, "normalized":False, "special":True}], normalizer=BertNormalizer(clean_text=True, handle_chinese_chars=True, strip_accents=None, lowercase=True), pre_tokenizer=BertPreTokenizer(), post_processor=TemplateProcessing(single=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0)], pair=[SpecialToken(id="[CLS]", type_id=0), Sequence(id=A, type_id=0), SpecialToken(id="[SEP]", type_id=0), Sequence(id=B, type_id=1), SpecialToken(id="[SEP]", type_id=1)], special_tokens={"[CLS]":SpecialToken(id="[CLS]", ids=[101], tokens=["[CLS]"]), "[SEP]":SpecialToken(id="[SEP]", ids=[102], tokens=["[SEP]"])}), decoder=WordPiece(prefix="##", cleanup=True), model=WordPiece(unk_token="[UNK]", continuing_subword_prefix="##", max_input_chars_per_word=100, vocab={"[PAD]":0, "[unused0]":1, "[unused1]":2, ...}))
The pre_tokenizer.Sequence and normalizer.Sequence are also more accessible now:
from tokenizers import normalizers
norm = normalizers.Sequence([normalizers.Strip(), normalizers.BertNormalizer()])
norm[0]
norm[1].lowercase=False

... (truncated)

Commits

d98298a 0.20.1
de305f2 update to ubuntu-22.04
1053470 use --interpreter ${{ matrix.interpreter || '3.7 3.8 3.9 3.10 3.11 3.12 pypy3...
f7c33eb add Cargo
eca17be v 0.20.1-rc1
557fde7 style: simplify string formatting for readability (#1632)
3d51a16 Fix documentation build (#1642)
294ab86 Bump webpack in /tokenizers/examples/unstable_wasm/www (#1641)
2b97a5e Bump send and express in /tokenizers/examples/unstable_wasm/www (#1631)
077678d Bump serve-static and express in /tokenizers/examples/unstable_wasm/www (#1630)
Additional commits viewable in compare view

Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting @dependabot rebase.

Dependabot commands and options

You can trigger Dependabot actions by commenting on this PR:

@dependabot rebase will rebase this PR
@dependabot recreate will recreate this PR, overwriting any edits that have been made to it
@dependabot merge will merge this PR after your CI passes on it
@dependabot squash and merge will squash and merge this PR after your CI passes on it
@dependabot cancel merge will cancel a previously requested merge and block automerging
@dependabot reopen will reopen this PR if it is closed
@dependabot close will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually
@dependabot show <dependency name> ignore conditions will show all of the ignore conditions of the specified dependency
@dependabot ignore this major version will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this minor version will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself)
@dependabot ignore this dependency will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself)

Bumps [tokenizers](https://github.com/huggingface/tokenizers) from 0.19.1 to 0.20.1. - [Release notes](https://github.com/huggingface/tokenizers/releases) - [Changelog](https://github.com/huggingface/tokenizers/blob/main/RELEASE.md) - [Commits](huggingface/tokenizers@v0.19.1...v0.20.1) --- updated-dependencies: - dependency-name: tokenizers dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com>

dependabot bot added the dependencies Pull requests that update a dependency file label Oct 10, 2024

dependabot bot mentioned this pull request Oct 10, 2024

chore(deps): bump tokenizers from 0.19.1 to 0.20.0 #335

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(deps): bump tokenizers from 0.19.1 to 0.20.1 #339

chore(deps): bump tokenizers from 0.19.1 to 0.20.1 #339

dependabot bot commented on behalf of github Oct 10, 2024

chore(deps): bump tokenizers from 0.19.1 to 0.20.1 #339

Are you sure you want to change the base?

chore(deps): bump tokenizers from 0.19.1 to 0.20.1 #339

Conversation

dependabot bot commented on behalf of github Oct 10, 2024

Release v0.20.1

What's Changed

New Contributors

Release v0.20.0: faster encode, better python support

Release v0.20.0

Performances:

Python API