Unicode: A Primer

This one word causes developers to cringe, managers to demand it, and customers to ask why Chinese isn’t being rendered in the application.

When it comes time to supporting Unicode in an application, the typical developer will check Stack Overflow and see if there are any language-isms to aid their quest to enable that cool 👍(thumbs up, if it doesn’t render on your browser) emoji.

The Problem

Here’s where things start get a bit more confusing: do you want UTF-8? UTF-16? UTF-32? A quick google search lands you with an answer of: “Uh, not sure. Depends.” Maybe you want to support multiple platforms. Maybe you need to have multiple applications interact over the network. iOS vs. Android. What to choose?

Low Down on Different Unicode Encodings

  • UTF-8: Encodes characters in 8 bits, with code points using 1, 2, 3, or 4 bytes to represent all characters up to U+10FFFF. Supported by most browsers, operating systems, and programming languages worldwide; the defacto choice when asked to develop with Unicode in mind.
  • UTF-16: Encodes characters in 16 bits, with code points using 2 or 4 bytes to represent all characters up to U+10FFFF. Many languages like Java and C# will implement UTF-16 internally within the language as well as Windows and OS X.
  • UTF-32: Encodes characters in 32 bits, with code points using 4 bytes to represent all characters up to U+10FFFF. Don’t use this. Just kidding, there just isn’t a lot of usage and/or support for this encoding. As a matter of fact, W3C stated in the HTML5 spec to not use UTF-32.

The Case for UTF-8

When I started to tackle this problem, I didn’t know which encoding to choose, but one thing about UTF-8 made it a stronger choice than the other encodings. UTF-8 maintains compatibility the ASCII character set. This meant a lot when dealing with “legacy code” that referred to ASCII character codes. Additionally, UTF-8 provides variable width encoding so the letter ‘A’ is still 1 byte in UTF-8.

Unicode Facts

Back to the core of Unicode: Unicode is separated into 17 planes which allows for 1,114,112(!) code points (i.e. characters or letters). That is a lot of characters, but the advantage is, pretty much any character, modern or ancient can be represented with Unicode, with room to spare for those dank emojis. For the rest of this discussion, we will only be talking about the Basic Multilingual Plane (BMP) and the Supplementary Multilingual Plane (SMP).

The BMP ranges from code points U+0000-U+FFFF, covering most languages. The SMP ranges from code points U+10000-U+1FFFF, covering some ancient languages and emojis. The BMP also contains a code point range called surrogates (we will come back to this concept) that allow 16 bit pairs to represent a 32 bit code point (code points in SMP and above).

Implementation (Java)

Ok, neat, a lot of details, but I just need to support Chinese, what do I do next? Well, this part is language specific to Java, but it’s similar in many languages. Java’s char data type is actually 2 bytes long, each char is a UTF-16 code point. The Character class has some great static methods like isHighSurrogate/isLowSurrogate and UnicodeBlock/UnicodeScript to identify the name of the block a character might have come from.

String source = "\u00a3\u00a5"; 
// 2 Latin-1 chars == 4 bytes in UTF-8 (GBP [£] and Yen [¥] signs)
System.out.println("Source: " + source + " is " + source.getBytes(StandardCharsets.UTF_8).length + " bytes long."); 
// will output "Source: £¥ is 4 bytes long."
source = "\uD83D\uDC4D"; 
// 1 Misc Symbols and Pictograph char == 4 bytes in UTF-8
System.out.println("Source: " + source + " is " + source.getBytes(StandardCharsets.UTF_8).length + " bytes long."); 
// will output "Source: 👍 is 4 bytes long."

One thing to keep in mind is that the length of the string (2) and the number of bytes in the string (4) will now be different, unlike ASCII.

Writing to file:

BufferedOutputStream os = new BufferedOutputStream(new FileOutputStream("some_unicode_file.txt"));
BufferedWriter writer = new BufferedWriter(new OutputStreamWriter(os, StandardCharsets.UTF_8));
writer.write("\u00a3\u00a5");

When reading in files:

BufferedInputStream is = new BufferedInputStream(new FileInputStream("some_unicode_file.txt"));
BufferedReader reader = new BufferedReader(new InputStreamReader(is, StandardCharsets.UTF_8));
String line;
while ((line = reader.readLine()) != null) {
	System.out.println(line);
}

A lot of preface to something that is pretty simple, no? Honestly, moving from ASCII to UTF-8 becomes difficult when changing an existing code base. Code that does a bounds check for the ASCII range (0-127) is where conversion may backfire. Keep in mind the inputs and outputs of your application, these are usually the problem areas when implementing Unicode support in an application, the programming language usually can handle what you give it.

Appendix: Surrogates

So what’s up with surrogates? Why even mention them? While tackling my ASCII to UTF-8 conversion on some Android projects, I realized I could add emoji support as well! As I started researching what code points emojis lived in, I found a serious problem.

Emojis are in the Miscellaneous Symbols and Pictographs, Supplemental Symbols and Pictographs, Emoticons, and Transport and Map Symbols blocks of the SMP (U+1F300-U+1F6F3). This means that 2 Java chars are needed to represent a single emoji, due to Java’s internal char encoding of UTF-16. Surrogate pairs are made up of high (upper) and low (lower) bytes that added together become a Unicode code point. In the Java implementation of a thumbs up, combining code points U+D83D and U+DC4D make U+1F44D (the code point for the thumbs up emoji).

The math: 0x10000 + (<High Surrogate> − 0xD800) × 0x400 + (<Low Surrogate> − 0xDC00)

0x10000 + (0xD38D − 0xD800) × 0x400 + (0xDC4D − 0xDC00) = 0x1F44D

I ended up wanting to use a length filter like InputFilter.LengthFilter in Android for UTF-8 strings and ended up finding a gem in the AOSP source code for just that. I made some tweaks to allow for surrogate pairs to be handled and voilá, I can limit the length of a UTF-8 string with emojis (or any other characters from the SMP).

Hopefully this served as a good primer to the world of Unicode!

Refs: