tokie 0.0.9


pip install tokie

  Latest version

Released: May 05, 2026


Meta
Requires Python: >=3.10

Classifiers

Programming Language
  • Rust
  • Python :: Implementation :: CPython
  • Python :: 3
  • Python :: 3.10
  • Python :: 3.11
  • Python :: 3.12
  • Python :: 3.13

Topic
  • Scientific/Engineering :: Artificial Intelligence
  • Text Processing
  • Text Processing :: Linguistic

Intended Audience
  • Developers
  • Science/Research

License
  • OSI Approved :: MIT License
  • OSI Approved :: Apache Software License

Development Status
  • 4 - Beta

Operating System
  • OS Independent

Typing
  • Typed

tokie

10-136x faster tokenization, 10x smaller model files, 100% accurate

GitHub · crates.io · HuggingFace


tokie is a fast, correct tokenizer library built in Rust with Python bindings. Drop-in replacement for HuggingFace tokenizers — supports BPE (GPT-2, tiktoken, SentencePiece), WordPiece (BERT), and Unigram encoders.

Installation

pip install tokie

Quick Start

import tokie

# Load from HuggingFace Hub (tries .tkz first, falls back to tokenizer.json)
tokenizer = tokie.Tokenizer.from_pretrained("bert-base-uncased")

# Encode — callable syntax or .encode()
encoding = tokenizer("Hello, world!")
print(encoding.ids)               # [101, 7592, 1010, 2088, 999, 102]
print(encoding.tokens)            # ['[CLS]', 'hello', ',', 'world', '!', '[SEP]']
print(encoding.attention_mask)    # [1, 1, 1, 1, 1, 1]
print(encoding.special_tokens_mask)  # [1, 0, 0, 0, 0, 1]

# Decode
text = tokenizer.decode(encoding.ids)  # "hello , world !"

# Token count (fast, no Encoding overhead)
count = tokenizer.count_tokens("Hello, world!")

# Batch encode (parallel across all cores)
encodings = tokenizer.encode_batch(["Hello!", "World"], add_special_tokens=True)

Padding & Truncation

# Truncate to max length (special tokens preserved)
tokenizer.enable_truncation(max_length=32)

# Pad all sequences in a batch to the same length
tokenizer.enable_padding(length=32, pad_id=tokenizer.pad_token_id or 0)

# Batch encode — all sequences same length, ready for model input
texts = ["Hello world", "Short", "A much longer sentence for testing"]
batch = tokenizer.encode_batch(texts, add_special_tokens=True)
for enc in batch:
    print(len(enc), enc.ids[:5])  # All length 32

Pair Encoding (Cross-Encoders)

pair = tokenizer("How are you?", "I am fine.")  # or tokenizer.encode_pair(...)
print(pair.ids)                # [101, 2129, 2024, 2017, 1029, 102, 1045, 2572, 2986, 1012, 102]
print(pair.type_ids)           # [0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1]
print(pair.special_tokens_mask)  # [1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1]

Byte Offsets

enc = tokenizer.encode_with_offsets("Hello world")
for token_id, (start, end) in zip(enc.ids, enc.offsets):
    print(f"  token {token_id}: bytes [{start}:{end}]")

Save & Load (.tkz format)

tokie's binary .tkz format is ~10x smaller than tokenizer.json and loads in ~5ms:

tokenizer.save("model.tkz")
tokenizer = tokie.Tokenizer.from_file("model.tkz")

Supported Models

Works with any HuggingFace tokenizer — GPT-2, BERT, Llama 3/4, Mistral, Phi, Qwen, T5, XLM-RoBERTa, and more.

Benchmarks

Model Text Size tokie HF tokenizers Speedup
BERT 900 KB 1.69 ms 229 ms 136x
GPT-2 900 KB 1.70 ms 181 ms 107x
Llama 3 900 KB 2.04 ms 190 ms 93x
Qwen 3 45 KB 0.15 ms 8.18 ms 54x
Gemma 3 45 KB 1.01 ms 9.62 ms 10x

100% token-accurate across all models.

License

MIT OR Apache-2.0

No dependencies