segments 2.3.0


pip install segments

  Latest version

Released: Feb 20, 2025


Meta
Author: Steven Moran and Robert Forkel
Requires Python: >=3.8

Classifiers

Development Status
  • 5 - Production/Stable

Intended Audience
  • Developers
  • Science/Research

Natural Language
  • English

Operating System
  • OS Independent

Programming Language
  • Python :: 3
  • Python :: 3.9
  • Python :: 3.10
  • Python :: 3.11
  • Python :: 3.12
  • Python :: 3.13
  • Python :: Implementation :: CPython
  • Python :: Implementation :: PyPy

License
  • OSI Approved :: Apache Software License

segments

Build Status PyPI

DOI

The segments package provides Unicode Standard tokenization routines and orthography segmentation, implementing the linear algorithm described in the orthography profile specification from The Unicode Cookbook (Moran and Cysouw 2018 DOI).

Command line usage

Create a text file:

$ echo "aäaaöaaüaa" > text.txt

Now look at the profile:

$ cat text.txt | segments profile
Grapheme        frequency       mapping
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Write the profile to a file:

$ cat text.txt | segments profile > profile.prf

Edit the profile:

$ more profile.prf
Grapheme        frequency       mapping
aa      0       x
a       7       a
ä       1       ä
ü       1       ü
ö       1       ö

Now tokenize the text without profile:

$ cat text.txt | segments tokenize
a ä a a ö a a ü a a

And with profile:

$ cat text.txt | segments --profile=profile.prf tokenize
a ä aa ö aa ü aa

$ cat text.txt | segments --mapping=mapping --profile=profile.prf tokenize
a ä x ö x ü x

API

>>> from segments import Profile, Tokenizer
>>> t = Tokenizer()
>>> t('abcd')
'a b c d'
>>> prf = Profile({'Grapheme': 'ab', 'mapping': 'x'}, {'Grapheme': 'cd', 'mapping': 'y'})
>>> print(prf)
Grapheme	mapping
ab	x
cd	y
>>> t = Tokenizer(profile=prf)
>>> t('abcd')
'ab cd'
>>> t('abcd', column='mapping')
'x y'

Wheel compatibility matrix

Platform Python 2 Python 3
any

Files in release

Extras:
Dependencies:
regex
csvw (>=1.5.6)