Why to normalize Unicode strings

In the first “Zoë”, the ë character (e with umlaut) was represented a single Unicode code point, while in the second case it was in the decomposed form. When encoded, the dog emoji can be represented in multiple byte sequences:

In a JavaScript source file, the following three statements print the same result, filling your console with lots of puppies:

Most JavaScript interpreters (including Node.js and modern browsers) use UTF-16 internally. For example, the letter could be represented using either:

The two characters look the same, but do not compare as equal, and the strings have different lenghts.

Source: withblue.ink

My Tech Blog

Tech news and links

Why to normalize Unicode strings

Why to normalize Unicode strings