Bytes, Characters and Python 2

Python logo

Moving from Python 2 to 3? Here's what you need to know about strings and their role in in your upgrade.

An old joke asks "What do you call someone who speaks three languages? Trilingual. Two languages? Bilingual. One language? American."

Now that I've successfully enraged all of my American readers, I can get to the point, which is that because so many computer technologies were developed in English-speaking countries—and particularly in the United States—the needs of other languages often were left out of early computer technologies. The standard established in the 1960s for translating numbers into characters (and back), known as ASCII (the American Standard Code for Information Interchange), took into account all of the letters, numbers and symbols needed to work with English. And that's all that it could handle, given that it was a seven-byte (that is, 128-character) encoding.

If you're willing to ignore accented letters, ASCII can sort of, kind of, work with other languages, as well—but the moment you want to work with another character set, such as Chinese or Hebrew, you're out of luck. Variations on ASCII, such as ISO-8859-x (with a number of values for "x"), solved the problem to a limited degree, but there were numerous issues with that system.

Unicode gives each character, in every language around the globe, a unique number. This allows you to represent (just about) every character in every language. The problem is how you can represent those numbers using bytes. After all, at the end of the day, bytes are still how data is stored to and read from filesystems, how data is represented in memory and how data is transmitted over a network. In many languages and operating systems, the encoding used is UTF-8. This ingenious system uses different numbers of bytes for different characters. Characters that appear in ASCII continue to use a single byte. Some other character sets (for example, Arabic, Greek, Hebrew and Russian) use two bytes per character. And yet others (such as Chinese and emojis) use three bytes per character.

In a modern programming language, you shouldn't have to worry about this stuff too much. If you get input from the filesystem, the user or the network, it should just come as characters. How many bytes each character needs is an implementation detail that you can (or should be able to) ignore.

Why do I mention this? Because a growing number of my clients have begun to upgrade from Python 2 to Python 3. Yes, Python 3 has been around for a decade already, but a combination of some massive improvements in the most recent versions and the realization that only 18 months remain before Python 2 is deprecated is leading many companies to realize, "Gee, maybe we finally should upgrade."

The major sticking point for many of them? The bytes vs. characters issue.