Log in / create account | Login with OpenID
DocForge
Programmer's Wiki

Unicode

From DocForge

Unicode is an industry standard designed to allow text and symbols from all of the writing systems of the world to be consistently represented and manipulated by computers. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard, Unicode consists of a character repertoire, an encoding methodology and set of standard character encodings, a set of code charts for visual reference, an enumeration of character properties such as upper and lower case, a set of reference data computer files, and rules for normalization, decomposition, collation and rendering.

Unicode has the explicit aim of transcending the limitations of traditional character encodings, such as those defined by the ISO 8859 standard which find wide usage in various countries of the world but remain largely incompatible with each other. Many traditional character encodings share a common problem in that they allow bilingual computer processing (usually using Roman characters and the local language) but not multilingual computer processing (computer processing of arbitrary languages mixed with each other).

Unicode's success at unifying character sets has led to its widespread and predominant use in the internationalization and localization of computer software. The standard has been implemented in many technologies, including XML, Java, and modern operating systems.

[edit] Implementation

Unicode encodes the underlying characters, graphemes and grapheme-like units, rather than the variant glyphs (renderings) for such characters. In the case of Chinese characters, this sometimes leads to controversies over distinguishing the underlying character from its variant glyphs (see Han unification).

In text processing, Unicode takes the role of providing a unique number, not a glyph, for each character. Therefore Unicode represents a character in an abstract way and leaves the visual rendering (size, shape, font or style) to other software, such as a web browser or word processor. This simple aim becomes complicated, however, by concessions made by Unicode's designers in the hope of encouraging a more rapid adoption of Unicode.

The first 256 code points were made identical to the content of ISO 8859-1 so as to make it trivial to convert existing western text. A lot of essentially identical characters were encoded multiple times at different code points to preserve distinctions used by legacy encodings and therefore allow conversion from those encodings to Unicode (and back) without losing any information. For example, the "fullwidth forms" section of code points encompasses a full Latin alphabet that is separate from the main Latin alphabet section. In Chinese, Japanese and Korean (CJK) fonts, these characters are rendered at the same width as CJK ideographs rather than at half the width. For other examples, see Duplicate characters in Unicode.

Also, while Unicode allows for combining characters it also contains precomposed versions of most letter/diacritic combinations in normal use. These make conversion to and from legacy encodings simpler and allow applications to use Unicode as an internal text format without having to implement combining characters. One letter, for example, can be represented in Unicode as U+0065 (Latin small letter e) followed by U+0301 (combining acute) but it can also be represented as the precomposed character U+00E9 (Latin small letter e with acute).

The Unicode standard also includes a number of related items, such as character properties, text normalisation forms and bidirectional display order (for the correct display of text containing both right-to-left scripts, such as Arabic or Hebrew, and left-to-right scripts).

[edit] Processing Unicode

The internal logic of much 8-bit legacy software typically permits only 8 bits for each character, making it impossible to use more than 256 code points without special processing. Sixteen-bit software can support only some tens of thousands of characters. Unicode, on the other hand, has already defined more than 100,000 encoded characters. Systems designers have therefore suggested several mechanisms for implementing Unicode; which one implementers choose depends on available storage space, source code compatibility, and interoperability with other systems.

Unicode defines two mapping methods:

  • UTF (Unicode Transformation Format) encodings
  • UCS (Universal Character Set) encodings

The encodings include:

  • UTF-7 — a relatively unpopular 7-bit encoding, often considered obsolete
  • UTF-8 — an 8-bit, variable-width encoding, which maximizes compatibility with ASCII.
  • UTF-EBCDIC — an 8-bit variable-width encoding, which maximizes compatibility with EBCDIC.
  • UCS-2 — a 16-bit, fixed-width encoding that only supports the BMP, considered obsolete
  • UTF-16 — a 16-bit, variable-width encoding
  • UCS-4 and UTF-32 — functionally identical 32-bit fixed-width encodings

The numbers in the names of the encodings indicate the number of bits in one code value (for UTF encodings) or the number of bytes per code value (for UCS) encodings. UTF-8 and UTF-16 are probably the most commonly used encodings.

UTF-8 uses one to four bytes per code point and, being compact for Latin scripts and ASCII-compatible, provides the de facto standard encoding for interchange of Unicode text. It is also used by most recent Linux distributions as a direct replacement for legacy encodings in general text handling.

UCS-2 is an obsolete, 16-bit fixed-width encoding covering the Basic Multilingual Plane only. For characters in the Basic Multilingual Plane (16 bit range), UCS-2 and UTF-16 are identical. Therefore they can be considered as different implementation levels of the same encoding. The UCS-2 and UTF-16 encodings specify the Unicode Byte Order Mark (BOM) for use at the beginnings of text files, which may be used for byte ordering detection (or byte endianness detection). Some software developers have adopted it for other encodings, including UTF-8, which does not need an indication of byte order. In this case it attempts to mark the file as containing Unicode text. The BOM, code point U+FEFF has the important property of unambiguity on byte reorder, regardless of the Unicode encoding used; U+FFFE (the result of byte-swapping U+FEFF) does not equate to a legal character, and U+FEFF in other places, other than the beginning of text, conveys the zero-width no-break space (a character with no appearance and no effect other than preventing the formation of ligatures). Also, the units FE and FF never appear in UTF-8. The same character converted to UTF-8 becomes the byte sequence EF BB BF.

UTF-16 is similar to UCS-2 but can include one or two 16-bit words in order to cover the supplementary characters (introduced from Unicode 3.1 onwards). UTF-16 is used by many APIs, often for upward compatibility with APIs that were developed when Unicode was UCS-2 based, or for compatibility with other APIs that use UTF-16. UTF-16 is the standard format for the Windows API (though surrogate support is not enabled by default) and for the Java (J2SE 1.5 or higher) and .NET bytecode environments.

In UTF-32 and UCS-4, one 32-bit code value serves as a fairly direct representation of any character's code point (although the endianness, which varies across different platforms, affects how the code value actually manifests as an octet sequence). In the other cases, each code point may be represented by a variable number of code values. UTF-32 is widely used as internal representation of text in programs (as opposed to stored or transmitted text), since every Unix operating system which uses the gcc compilers to generate software use it as the standard "wide character" encoding. Recent versions of the python programming language (beginning with 2.2) may also be configured to use UTF-32 as the representation for unicode strings, effectively disseminating such encoding in high-level coded software.

Punycode, another encoding form, enables the encoding of Unicode strings into the limited character set supported by the ASCII-based Domain Name System. The encoding is used as part of IDNA, which is a system enabling the use of Internationalized Domain Names in all languages that are supported by Unicode.

GB18030 is another encoding form for Unicode, from the Standardization Administration of China. It is the official character set of the People's Republic of China (PRC).


Additional copyright notice: Some content of this page is a derivative work of a Wikipedia article under the GNU FDL. The original article and author information can be found at http://en.wikipedia.org/wiki/Unicode.