Fast and Reasonably Accurate Word Tokenizer for Thai
Project Links
Meta
Author: Pattarawat Chormai et al.
Classifiers
Natural Language
- Thai
Programming Language
- Python :: 3
License
- OSI Approved :: MIT License
Operating System
- OS Independent
AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai
How does AttaCut look like?

TL;DR: 3-Layer Dilated CNN on syllable and character features. It’s 6x faster than DeepCut (SOTA) while its WL-f1 on BEST is 91%, only 2% lower.
Installation
$ pip install attacut
Remarks: Windows users need to install PyTorch before the command above. Please consult PyTorch.org for more details.
Usage
Command-Line Interface
$ attacut-cli -h
AttaCut: Fast and Reasonably Accurate Word Tokenizer for Thai
Usage:
attacut-cli <src> [--dest=<dest>] [--model=<model>]
attacut-cli [-v | --version]
attacut-cli [-h | --help]
Arguments:
<src> Path to input text file to be tokenized
Options:
-h --help Show this screen.
--model=<model> Model to be used [default: attacut-sc].
--dest=<dest> If not specified, it'll be <src>-tokenized-by-<model>.txt
-v --version Show version
High-Level API
from attacut import tokenize, Tokenizer
# tokenize `txt` using our best model `attacut-sc`
words = tokenize(txt)
# alternatively, an AttaCut tokenizer might be instantiated directly, allowing
# one to specify whether to use `attacut-sc` or `attacut-c`.
atta = Tokenizer(model="attacut-sc")
words = atta.tokenize(txt)
Benchmark Results
Belows are brief summaries. More details can be found on our benchmarking page.
Tokenization Quality
Speed
Retraining on Custom Dataset
Please refer to our retraining page
Related Resources
Acknowledgements
This repository was initially done by Pattarawat Chormai, while interning at Dr. Attapol Thamrongrattanarit's NLP Lab, Chulalongkorn University, Bangkok, Thailand. Many people have involed in this project. Complete list of names can be found on Acknowledgement.
Mar 13, 2020
1.1.0.dev0
Nov 21, 2019
1.0.6
Nov 21, 2019
1.0.6.dev0
Oct 18, 2019
1.0.5
Oct 01, 2019
1.0.4
Oct 01, 2019
1.0.4.dev0
Oct 01, 2019
1.0.3
Oct 01, 2019
1.0.3.dev0
Sep 08, 2019
1.0.2
Sep 08, 2019
1.0.2.dev0
Sep 01, 2019
1.0.1
Sep 01, 2019
1.0.0
Aug 30, 2019
0.0.6.dev0
Aug 30, 2019
0.0.5.dev0
Aug 29, 2019
0.0.4.dev0
Aug 25, 2019
0.0.3.dev0
Aug 25, 2019
0.0.2.dev0
Wheel compatibility matrix
Files in release
Extras:
None
Dependencies:
(>=0.6.2)
docopt
(>=0.1.3)
fire
(>=0.2.0)
nptyping
(>=1.17.0)
numpy
(>=5.1.2)
pyyaml
(>=1.12.0)
six
(>=0.0.4)
ssg
(>=1.2.0)
torch