Skip to main content

PRECIS, the next step in Unicode validation

PRECIS (Preparation, Enforcement, and Comparison of Internationalized Strings) is a framework for consistent and secure management of Unicode strings in web applications.

If you haven’t read my previous article Input validation of free-form Unicode text in Python, that contained the problem statement and low-level solution using Unicode character categories. PRECIS goes one step further by proposing specific string classes that represent typical usage scenarios involving processing of Unicode strings.

PRECIS starts from just two use cases — string used as an identifier, that will be subsequently used in URIs and databases, where one of the most challenging problems is reliable comparison. For example, are “ŻÓBR” and “ŻÓBR” the same usernames, or group names? Visually they should be identical in most fonts and displays, and both could have been honestly typed by the same user using different keyboards, yet they are composed of different code points.

First, using a non-combining keyboard:

> import unicodedata
> x='ŻÓBR'
> for c in x: print(f'{c}: {unicodedata.name(c)}')
Ż: LATIN CAPITAL LETTER Z WITH DOT ABOVE
Ó: LATIN CAPITAL LETTER O WITH ACUTE
B: LATIN CAPITAL LETTER B
R: LATIN CAPITAL LETTER R

Second, using letters followed by combining accents:

> x='Z\u0307O\u0301BR'
> x
'ZOBR'
> for c in x: print(f'{c}: {unicodedata.name(c)}')
Z: LATIN CAPITAL LETTER Z
: COMBINING DOT ABOVE
O: LATIN CAPITAL LETTER O
: COMBINING ACUTE ACCENT
B: LATIN CAPITAL LETTER B
R: LATIN CAPITAL LETTER R

Usual byte-by-byte comparison will fail, and if you’re not careful your application will allow creation of visually identical usernames that are assigned distinct user objects. In my previous article (Input validation of free-form Unicode text in Python) I suggested using Unicode normalisation to always convert these homoglyphic forms into a single, consistent one.

PRECIS

The two string classes proposed by PRECIS are IdentifierClass and FreeformClass, and their purpose is quite self-describing. What sits inside them, is a carefully selected combination of character classes (such as letter, digits, spaces) that are allowed, others that are disallowed (e.g. funny text direction changing characters), additional contextual rules as well as policy towards characters that are yet unknown in the current version of Unicode.

As you can guess, these rules for IdentifierClass are much more strings, while for FreeformClass they are much more lax and permissive. Not surprisingly, Unicode normalisation (specifically, NFC) is an important part of these transformations. On top of these basic string classes, you can build your own string profiles, that reflect your applications data objects more accurately.

For example, one Python library precis-i18n implements UsernameCasePreseved (strict) and NicknameCasePreserved (lax). Here’s what happens when you try to pass my name through both of them. First, nickname profile, apparently indended to be displayed as the profile name but not used in identifiers:

> import precis_i18n
> precis_i18n.get_profile('NicknameCasePreserved').enforce('Paweł Krawczyk')
'Paweł Krawczyk'

However, let’s try to embed the infamous U+202E text direction changing control character in the nickname:

In [31]: precis_i18n.get_profile('NicknameCasePreserved').enforce('file.\u202etxt.exe')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
…
UnicodeEncodeError: 'NicknameCasePreserved' codec can't encode character '\u202e' in position 5: DISALLOWED/precis_ignorable_properties

The profile disallowed use of the control character in a nickname. Let’s now see how my name will be treated by the more strict UsernameCasePreserved profile:

In [30]: precis_i18n.get_profile('UsernameCasePreserved').enforce('Paweł Krawczyk')
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
…
UnicodeEncodeError: 'UsernameCasePreserved' codec can't encode character '\x20' in position 5: DISALLOWED/spaces

This is because space U+0020 is disallowed in this particular profile, designed apparently for address identifier such as emails or XMPP addresses, which both disallow spaces (PRECIS was developed by the smart folks from Jabber.org, precursor of XMPP messaging protocol).

However, PRECIS profiles aren’t necessarily only applied to usernames — it also makes sense to apply them to passwords, for example to prevent users from embedding trailing end-of-line, tabs, U+202E and other confusing characters that would prevent them from entering the password correctly again.

> precis_i18n.get_profile('OpaqueString').enforce('ucei=The4e-iy5am=3iemoo')
'ucei=The4e-iy5am=3iemoo'

Therefore, a “proper” high-entropy password is correctly processed through PRECIS OpaqueString profile. The same profile will however reject a password candidate with U+202E embedded, but also an empty password, or password composed of control characters (such as tab).

References