Unicode Normalizer 0.1.2 Scripts 4.0 Community
Submitted by user Goutte; MIT; 2023-12-01
Tookit to handle removing diacritics and substitutable characters from unicode strings.
Provides a UnicodeNormalizer singleton that helps normalize your unicode strings by :
- removing diacritics (decomposing, then keeping only the first character)
- substituting fallback characters
- being blazingly fast (binary search)
- being lightweight
- being extensible
Its replacement database is built from the official unicode.org data. It is only about 16Kio.
Usage Example :
You can use the `normalize` method on the autoload singleton `UnicodeNormalizer`:
UnicodeNormalizer.normalize("Dès Noël, où un zéphyr haï me vêt")
# "Des Noel, ou un zephyr hai me vet"
You can also exclude some characters from the normalization by removing the from the mapping :
var allowed_decomposables := "éàè"
for i in allowed_decomposables.length():
UnicodeNormalizer.mapping.remove_decomposable(allowed_decomposables.unicode_at(i))
Finally, the UnicodeNormalizer is made to be extended, in order to adapt to specific needs.
View files Download Submit an issue Recent Edits