Unicode(formally, the Unicode Standard) is a framework for encoding, representing, and managing text on digital systems.Maintained by the Unicode Consortium, Unicode currently defines nearly 150, 000 characters across a wide variety of scripts and symbols.Unicode assigns each character a code point, a unique number, but it’s essential to remember that Unicode itself is not an encoding; it’s a map of code points that encodings like UTF - 8, UTF - 16, and UTF - 32 then translate into bytes.
As of Unicode v15.0, the standard includes 161 scripts—some used by only one language and others by multiple languages. Unicode supports current and historic scripts, including extinct languages.
It’s common to refer to characters as "glyphs," but technically a glyph is the visual representation of a character and can vary depending on the font.
In Unicode, a script is a set of characters and symbols used for writing one or more languages. For example:
Some scripts, like Armenian, represent only one language, while others, like Latin, are used by multiple languages, including English, French, and Vietnamese.
Before Unicode, text encoding was fragmented into various character encodings specific to languages or regions (e.g., ASCII, ISO 8859, and Windows-1252). These encoding standards often conflicted, resulting in garbled text when transferred across systems that used different encoding schemes.
For example:
Unicode was created to unify these encoding standards into a single, comprehensive system to represent most of the world's languages.
When dealing with text as bytes, knowing the encoding used to generate those bytes is crucial. Text without encoding context is meaningless to a computer, as bytes are just binary data until interpreted by an encoding scheme.
This is especially true in I/O operations. When a text file is sent from computer A in Japan to computer B in the USA, we need the correct encoding to interpret the byte stream properly.
Encoding is often specified by context:
When encoding isn’t specified, heuristics can help, but they are unreliable. Always specify encoding to avoid misinterpretation.
UTF-8 is a variable-length encoding that represents Unicode code points in 1 to 4 bytes. It’s backward-compatible with ASCII, meaning ASCII text is also valid UTF-8 text.
Example:
UTF-8 is widely used on the web and is often the default encoding in systems, programming languages, and applications.
In most applications, UTF-8 is the best choice due to its compatibility and efficiency.
Why not just stick with ASCII? Well, ASCII only covers 128 characters—barely enough for basic English. Unicode, however, aims to cover all text systems in a single, universal standard, eliminating the confusion and compatibility issues of past encoding schemes.
In the early 90s, the internet accelerated Unicode’s adoption as more text needed to be exchanged globally. Unicode incorporates characters from legacy encodings (e.g., Windows-1252) to ensure backward compatibility.
"Unicode is 16-bit and has 65,536 characters." Incorrect. Unicode is not limited to 16 bits. While early Unicode proposals suggested a 16-bit system, Unicode today spans over 1 million code points (0 to 0x10FFFF).
"Unicode is an encoding." Also incorrect. Unicode is a standard, not an encoding. UTF-8, UTF-16, and UTF-32 are encoding formats that define how to represent Unicode code points in bytes.
Handling Unicode effectively requires knowing:
Pro-tip: Modern systems default to UTF-8, but misinterpretations can still occur, often with ISO-8859-1 (latin1) or cp1252 on legacy systems.
Here's a Python example to illustrate basic Unicode encoding and decoding. (soao)[/graphics/blog/10-minutes-of-unicode/string_to_x.png]
print(chr(0x1F64F)) # (Unicode character represented by code point U+1F64F)
🙏
# Also, chr() creates characters from code points.
# We could do chr(0xFF) or chr(33)
print(hex(ord("🙏"))) # -> 0x1f64f
# ord() is the reverse operation - giving us the base10 codepoint for a specific character
[ord(i) for i in "hello world"] # -> [104, 101, 108, 108, 111, 32, 119, 111, 114, 108, 100]
# We can represent characters in many analogue ways
r""" Escape Sequence Meaning How To Express "a"
"\ooo" Character with octal value ooo "\141"
"\xhh" Character with hex value hh "\x61"
"\N{name}" Character named name in the Unicode database "\N{LATIN SMALL LETTER A}"
"\uxxxx" Character with 16-bit (2-byte) hex value xxxx "\u0061"
"\Uxxxxxxxx" Character with 32-bit (4-byte) hex value xxxxxxxx "\U00000061"
"""
print(
"a" ==
"\x61" ==
"\N{LATIN SMALL LETTER A}" ==
"\u0061" ==
"\U00000061"
) # Prints `True` as they are all identical
# bin, hex and oct are only representations, not a fundamental change in the input.
>>> import unicodedata
>>> unicodedata.name("€")
'EURO SIGN'
>>> unicodedata.lookup("EURO SIGN")
'€'
str(b"\xc2\xbc cup of flour", "utf-8") # -> '¼ cup of flour'
str(0xc0ffee) # -> '12648430'
text = "Hello, Unicode! 🌍"
utf8_encoded = text.encode('utf-8')
print(f"UTF-8 Encoded: {utf8_encoded}")
utf8_decoded = utf8_encoded.decode('utf-8')
print(f"Decoded Text: {utf8_decoded}")
emoji = "🤨" # U+1F928
print(f"Emoji bytes in UTF-8: {emoji.encode('utf-8')}")
Encoding text with UTF-8 converts characters into byte representations, which is crucial for I/O operations.
Working with Unicode can seem complex, but the main takeaway is this: Unicode standardizes the representation of text across different languages and platforms. Encodings (like UTF-8) implement Unicode by translating its code points into bytes. Each character we see is not just a letter or symbol but a defined code point, encoded into bytes, and then rendered as a glyph.