Unicode Normalizer's icon

Unicode Normalizer 0.1.2 Scripts 4.0 Community

Submitted by user Goutte; MIT; 2023-12-01

Tookit to handle removing diacritics and substitutable characters from unicode strings.

Provides a UnicodeNormalizer singleton that helps normalize your unicode strings by :

- removing diacritics (decomposing, then keeping only the first character)
- substituting fallback characters
- being blazingly fast (binary search)
- being lightweight
- being extensible

Its replacement database is built from the official unicode.org data. It is only about 16Kio.

Usage Example :

You can use the `normalize` method on the autoload singleton `UnicodeNormalizer`:

UnicodeNormalizer.normalize("Dès Noël, où un zéphyr haï me vêt")
# "Des Noel, ou un zephyr hai me vet"

You can also exclude some characters from the normalization by removing the from the mapping :

var allowed_decomposables := "éàè"
for i in allowed_decomposables.length():
UnicodeNormalizer.mapping.remove_decomposable(allowed_decomposables.unicode_at(i))

Finally, the UnicodeNormalizer is made to be extended, in order to adapt to specific needs.


View files Download Submit an issue Recent Edits