Writing software correctly is hard, don’t let anyone fool you into thinking it’s not. I’ve been doing this for over 10 years and I still managed to mess up something as “basic” as comparing strings for equality in a side project of mine that deals extensively with written text (Markdown).

The Unicode standard allows for certain (visually) identical characters to be represented in different ways. For example the character ä may be represented as a single combined codepoint “Latin Small Letter A with Diaeresis” (U+00E4) or by the combination of “Latin Small Letter A” (U+0061) followed by “Combining Diaeresis” (U+0308). 1 The semantic meaning and visual representation is exactly the same, but the underlying codepoints are different.

There’s a defined process for normalizing these representations to a canonical format, described in Unicode® Standard Annex #15, which can be used to turn these two representations into the exact same form so they compare as equal (with some further caveats).

I’d read about it a long time ago, but had entirely forgotten this was a thing by the time I implemented the relevant string-comparison logic in obsidian-export (which got fixed by applying unicode normalization before doing comparisons 2).

This is likely a pitfall many more people end up stepping into after me, but sharing this will hopefully prevent a few of those cases. 🙂 While as an industry we’re getting better about dealing with non-ASCII text and many other sorts of localization issues, I do feel as though we don’t spend enough attention to this problem in an approachable, beginner-friendly manner.

Language design

It’s also interesting to look at this problem from an API design perspective. This happened in Rust, a low-level systems programming language that -by design- expects you to know and care about these intricacies.

High-level programming languages could choose to trade off some performance and provide abstractions over these classes of problems. In Python for example, strings are immutable sequences of Unicode code points. 3 It could choose to apply Unicode normalization on string construction, requiring you to opt-out by using special syntax or dedicated functions to create non-normalized strings when you don’t want such default behavior.

Would this be desirable? That’s certainly debatable, but it would bring the advantage of doing “the expected thing” in more cases where people are simply oblivious to the original problem.

(Other things that might help is including more awareness of this problem in for example function/comparison operator documentation, which could also be said for raising awareness of constant time string comparison in security-sensitive contexts)

  1. When encoded with UTF-8, these are represented as respectively the two bytes 0xC3, 0xA4, and the three bytes 0x61, 0xCC, 0x88↩︎

  2. Did I mention correctness is hard? I’m not even feeling 100% confident I’m doing case-folding entirely correct for all situations here. ↩︎

  3. Which is itself a somewhat curious choice, various other languages went with representing strings internally as UTF-8 encoded data. ↩︎