Development Status
- 5 - Production/Stable
Operating System
- OS Independent
Programming Language
- Python :: 3
- C
- License:
MIT License
1. Introduction
polyleven is a Pythonic Levenshtein distance library that:
Is fast regardless of input types, and hence can be applied to both short (like English words) and long inputs (like DNA sequences).
Is stand-alone with no external dependencies.
Is distributed under the MIT License, hence can be used freely in private projects.
2. How to install
The official package is available on PyPI:
$ pip install polyleven
3. How to use
Polyleven provides a single interface function levenshtein(). You can use this function to measure the similarity of two strings.
>>> from polyleven import levenshtein
>>> levenshtein('aaa', 'ccc')
3
If you only care about distances under a certain threshold, you can pass the max threshold to the third argument.
>>> levenshtein('acc', 'ccc', 1)
1
>>> levenshtein('aaa', 'ccc', 1)
2
In general, you can gain a noticeable speed boost with threshold \(k < 3\).
4. Benchmark
4.1 English Words
To compare Polyleven with other Pythonic edit distance libraries, a million word pairs was generated from SCOWL.
Each library was measured how long it takes to evaluate all of these words. The following table summarises the result:
Function Name |
TIME[sec] |
SPEED[pairs/s] |
|---|---|---|
edlib |
4.763 |
208216 |
editdistance |
1.943 |
510450 |
jellyfish.levenshtein_distance |
0.722 |
1374081 |
distance.levenshtein |
0.623 |
1591396 |
Levenshtein.distance |
0.500 |
1982764 |
polyleven.levenshtein |
0.431 |
2303420 |
4.2. Longer Inputs
To evaluate the efficiency for longer inputs, I created 5000 pairs of random strings of size 16, 32, 64, 128, 256, 512 and 1024.
Each library was measured how fast it can process these entries. [1]
Library |
N=16 |
N=32 |
N=64 |
N=128 |
N=256 |
N=512 |
N=1024 |
|---|---|---|---|---|---|---|---|
edlib |
0.040 |
0.063 |
0.094 |
0.205 |
0.432 |
0.908 |
2.089 |
editdistance |
0.027 |
0.049 |
0.086 |
0.178 |
0.336 |
0.740 |
58.139 |
jellyfish |
0.009 |
0.032 |
0.118 |
0.470 |
1.874 |
8.877 |
42.848 |
distance |
0.007 |
0.029 |
0.109 |
0.431 |
1.726 |
6.950 |
27.998 |
Levenshtein |
0.006 |
0.022 |
0.085 |
0.336 |
1.328 |
5.286 |
21.097 |
polyleven |
0.003 |
0.005 |
0.010 |
0.043 |
0.149 |
0.550 |
2.109 |
4.3. List of Libraries
Library |
Version |
URL |
|---|---|---|
edlib |
v1.2.1 |
|
editdistance |
v0.4 |
|
jellyfish |
v0.5.6 |
|
distance |
v0.1.3 |
|
Levenshtein |
v0.12 |
|
polyleven |
v0.3 |