tokenizers 0.22.1


pip install tokenizers

  Latest version

Released: Sep 19, 2025

Project Links

Meta
Author: Nicolas Patry, Anthony Moi
Requires Python: >=3.9

Classifiers

Development Status
  • 5 - Production/Stable

Intended Audience
  • Developers
  • Education
  • Science/Research

License
  • OSI Approved :: Apache Software License

Operating System
  • OS Independent

Programming Language
  • Python :: 3
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Python :: 3.12
  • Python :: 3.13
  • Python :: 3 :: Only

Topic
  • Scientific/Engineering :: Artificial Intelligence



Build GitHub


Tokenizers

Provides an implementation of today's most used tokenizers, with a focus on performance and versatility.

Bindings over the Rust implementation. If you are interested in the High-level design, you can go check it there.

Otherwise, let's dive in!

Main features:

  • Train new vocabularies and tokenize using 4 pre-made tokenizers (Bert WordPiece and the 3 most common BPE versions).
  • Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
  • Easy to use, but also extremely versatile.
  • Designed for research and production.
  • Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
  • Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

Installation

With pip:

pip install tokenizers

From sources:

To use this method, you need to have the Rust installed:

# Install with:
curl https://sh.rustup.rs -sSf | sh -s -- -y
export PATH="$HOME/.cargo/bin:$PATH"

Once Rust is installed, you can compile doing the following

git clone https://github.com/huggingface/tokenizers
cd tokenizers/bindings/python

# Create a virtual env (you can use yours as well)
python -m venv .env
source .env/bin/activate

# Install `tokenizers` in the current virtual env
pip install -e .

Load a pretrained tokenizer from the Hub

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_pretrained("bert-base-cased")

Using the provided Tokenizers

We provide some pre-build tokenizers to cover the most common cases. You can easily load one of these using some vocab.json and merges.txt files:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
vocab = "./path/to/vocab.json"
merges = "./path/to/merges.txt"
tokenizer = CharBPETokenizer(vocab, merges)

# And then encode:
encoded = tokenizer.encode("I can feel the magic, can you?")
print(encoded.ids)
print(encoded.tokens)

And you can train them just as simply:

from tokenizers import CharBPETokenizer

# Initialize a tokenizer
tokenizer = CharBPETokenizer()

# Then train it!
tokenizer.train([ "./path/to/files/1.txt", "./path/to/files/2.txt" ])

# Now, let's use it:
encoded = tokenizer.encode("I can feel the magic, can you?")

# And finally save it somewhere
tokenizer.save("./path/to/directory/my-bpe.tokenizer.json")

Provided Tokenizers

  • CharBPETokenizer: The original BPE
  • ByteLevelBPETokenizer: The byte level version of the BPE
  • SentencePieceBPETokenizer: A BPE implementation compatible with the one used by SentencePiece
  • BertWordPieceTokenizer: The famous Bert tokenizer, using WordPiece

All of these can be used and trained as explained above!

Build your own

Whenever these provided tokenizers don't give you enough freedom, you can build your own tokenizer, by putting all the different parts you need together. You can check how we implemented the provided tokenizers and adapt them easily to your own needs.

Building a byte-level BPE

Here is an example showing how to build your own byte-level BPE by putting all the different pieces together, and then saving it to a single file:

from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Initialize a tokenizer
tokenizer = Tokenizer(models.BPE())

# Customize pre-tokenization and decoding
tokenizer.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=True)
tokenizer.decoder = decoders.ByteLevel()
tokenizer.post_processor = processors.ByteLevel(trim_offsets=True)

# And then train
trainer = trainers.BpeTrainer(
    vocab_size=20000,
    min_frequency=2,
    initial_alphabet=pre_tokenizers.ByteLevel.alphabet()
)
tokenizer.train([
    "./path/to/dataset/1.txt",
    "./path/to/dataset/2.txt",
    "./path/to/dataset/3.txt"
], trainer=trainer)

# And Save it
tokenizer.save("byte-level-bpe.tokenizer.json", pretty=True)

Now, when you want to use this tokenizer, this is as simple as:

from tokenizers import Tokenizer

tokenizer = Tokenizer.from_file("byte-level-bpe.tokenizer.json")

encoded = tokenizer.encode("I can feel the magic, can you?")
0.22.1 Sep 19, 2025
0.22.1rc0 Sep 19, 2025
0.22.0 Aug 29, 2025
0.22.0rc0 Aug 29, 2025
0.21.4 Jul 28, 2025
0.21.2 Jun 24, 2025
0.21.2rc0 Jun 24, 2025
0.21.1 Mar 13, 2025
0.21.1rc0 Mar 12, 2025
0.21.0 Nov 27, 2024
0.21.0rc0 Nov 27, 2024
0.20.4 Nov 26, 2024
0.20.4rc0 Nov 26, 2024
0.20.3 Nov 05, 2024
0.20.3rc0 Nov 15, 2024
0.20.2 Nov 04, 2024
0.20.1 Oct 10, 2024
0.20.1rc1 Oct 10, 2024
0.20.0 Aug 08, 2024
0.19.1 Apr 17, 2024
0.19.0 Apr 17, 2024
0.15.2 Feb 12, 2024
0.15.1 Jan 22, 2024
0.15.0 Nov 14, 2023
0.14.1 Oct 06, 2023
0.14.0 Sep 07, 2023
0.13.3 Apr 05, 2023
0.13.2 Nov 07, 2022
0.13.1 Oct 06, 2022
0.13.0 Sep 21, 2022
0.12.1 Apr 13, 2022
0.12.0 Mar 31, 2022
0.11.6 Feb 28, 2022
0.11.5 Feb 16, 2022
0.11.4 Jan 17, 2022
0.11.3 Jan 17, 2022
0.11.2 Jan 04, 2022
0.11.1 Dec 28, 2021
0.11.0 Dec 23, 2021
0.10.3 May 24, 2021
0.10.2 Apr 05, 2021
0.10.1 Feb 08, 2021
0.10.0 Jan 12, 2021
0.10.0rc1 Dec 08, 2020
0.9.4 Nov 10, 2020
0.9.3 Oct 26, 2020
0.9.2 Oct 15, 2020
0.9.1 Oct 13, 2020
0.9.0 Oct 09, 2020
0.9.0rc2 Oct 06, 2020
0.9.0rc1 Sep 29, 2020
0.9.0.dev4 Sep 24, 2020
0.9.0.dev3 Sep 18, 2020
0.9.0.dev2 Sep 14, 2020
0.9.0.dev1 Sep 14, 2020
0.9.0.dev0 Aug 21, 2020
0.8.1 Jul 20, 2020
0.8.1rc2 Jul 17, 2020
0.8.1rc1 Jul 06, 2020
0.8.0 Jun 26, 2020
0.8.0rc4 Jun 26, 2020
0.8.0rc3 Jun 22, 2020
0.8.0rc2 Jun 19, 2020
0.8.0rc1 Jun 11, 2020
0.8.0.dev2 Jun 03, 2020
0.8.0.dev1 May 27, 2020
0.8.0.dev0 May 21, 2020
0.7.0 Apr 17, 2020
0.7.0rc7 Apr 16, 2020
0.7.0rc6 Apr 16, 2020
0.7.0rc5 Apr 09, 2020
0.7.0rc4 Apr 08, 2020
0.7.0rc3 Mar 31, 2020
0.7.0rc2 Mar 27, 2020
0.6.0 Mar 02, 2020
0.5.2 Feb 24, 2020
0.5.1 Feb 24, 2020
0.5.0 Feb 19, 2020
0.4.2 Feb 11, 2020
0.4.1 Feb 11, 2020
0.4.0 Feb 10, 2020
0.3.0 Feb 05, 2020
0.2.1 Jan 22, 2020
0.2.0 Jan 20, 2020
0.1.1 Jan 12, 2020
0.1.0 Jan 10, 2020
0.0.13 Jan 08, 2020
0.0.12 Jan 07, 2020
0.0.11 Dec 27, 2019
0.0.10 Dec 26, 2019
0.0.9 Dec 23, 2019
0.0.8 Dec 20, 2019
0.0.7 Dec 17, 2019
0.0.6 Dec 17, 2019
0.0.5 Dec 13, 2019
0.0.4 Dec 10, 2019
0.0.3 Dec 03, 2019
0.0.2 Dec 02, 2019
0.0.1 Nov 01, 2019
Extras:
Dependencies:
huggingface-hub (<2.0,>=0.16.4)