A (nicer) tokenizer you want to use for model `inference` and `training`: with all known peventable `gotchas` normalized or auto-fixed.
Project Links
Meta
Author: ModelCloud
Requires Python: >=3
Classifiers
Programming Language
- Python :: 3
Operating System
- OS Independent
Toke(n)icer
A (nicer) tokenizer you want to use for model inference and training: with all known peventable gotchas normalized or auto-fixed.
News
-
03/03/2026 0.0.7: Fix Qwen 3.5 MoE compat.
-
02/09/2026 0.0.6: Fix ChatGLM compat.
-
09/04/2025 0.0.5: Fix
pad_token_iddetection forLongCatmodel. -
02/21/2025 0.0.4: ⚡ Now
tokenicerinstance dynamically inherits thenativetokenizer.__class__of tokenizer passed in or loaded via ourtokenicer.load()api. CI now tests tokenizer compat from64different models. -
02/10/2025 0.0.2: 🤗 Initial release!
Features:
- Compatible with all HF
Transformersrecognized tokenizers - Auto-fix
modelsnot settingpadding_token - Auto-Fix
modelsreleased with wrongpadding_token: manymodelsincorrectly useeos_tokenaspad_tokenwhich leads to subtle and hidden errors in post-training and inference whenbatchingis used which is almost always. - Zero external dependency outside of
Transformers
Upcoming Features:
- Add
automatictokenizer validation tomodeltrainingand subsequentinferenceso that not only tokenizer config but actualdecode/encodeare 100% re-validated on model load. Often the case,inferenceandtrainingengines modifies the traditional tokenizers causing subtle and inaccurate output wheninferenceperformed on a platform that is disjointed from thetrainer.
Install
PIP/UV
pip install -v tokenicer
uv pip install -v tokenicer
Install from source
# clone repo
git clone https://github.com/ModelCloud/Tokencier.git && cd Tokenicer
# compile
pip install -v .
Usage
- Replace all calls to
AutoTokenizer.from_pretrained()withTokenizer.load(): args are 100% compatible withAutoTokenizer
# Replace `AutoTokenizer.from_pretrained()`
# from tokenizer import AutoTokenizer
# tokenizer = AutoTokenizer.from_pretrained('Qwen/Qwen2.5-0.5B-Instruct')
# With `Tokenicer.load()`
from tokenicer import Tokenicer
# Returns `Tokenicer` instance that inherits original `Qwen2TokenizerFast` type.
tokenizer = Tokenicer.load('Qwen/Qwen2.5-0.5B-Instruct')
# That's it! Toke(n)icer has auto-fixed Qwen2.5-0.5B-Instruct's incorrect `pad_token`.
# Now this this model can be `trained` and `inferenced` correctly with `batch` and `masks`.
# Now use the new tokenizer like any normal HF PretrainedTokenizer(Fast)
print(f"pad_token: `{tokenizer.pad_token}`")
- If you already have a loaded or composite config, pass it directly so Tokenicer can normalize the resolved text config in-place:
tokenizer = Tokenicer.load(tokenizer, model_config=model.config)
Citation
@misc{gptqmodel,
author = {ModelCloud.ai and qubitium@modelcloud.ai},
title = {Toke(n)icer},
year = {2025},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/modelcloud/tokenicer}},
note = {Contact: qubitium@modelcloud.ai}
}
0.0.12
Mar 26, 2026
0.0.11
Mar 15, 2026
0.0.10
Mar 15, 2026
0.0.9
Mar 15, 2026
0.0.8
Mar 06, 2026
0.0.7
Mar 03, 2026
0.0.6
Feb 09, 2026
0.0.5
Sep 04, 2025
0.0.4
Feb 21, 2025
0.0.3
Feb 21, 2025
0.0.2
Feb 10, 2025
0.0.1
Feb 10, 2025
0.0.1.dev0
Feb 08, 2025
Files in release
No dependencies