Information about Comparison Of Unicode Encodings
| Unicode |
|---|
| Encodings |
| UCS |
| Mapping |
| Bi-directional text |
| BOM |
| Han unification |
| Unicode and HTML |
| Unicode and E-mail |
| Unicode typefaces |
Compatibility Issues
A UTF-8 file that only contains ASCII characters is identical to an ASCII file.UTF-16 and UTF-32 are incompatible with ASCII files. Unicode-aware programs are required to display, print and manipulate them.
This means that UTF-16 systems such as Windows and Java represent text objects such as program code as 8 bit ASCII, not UTF-16. Indeed it is very rare to find a UTF-16 encoded text file on any system unless it is part of some more complex structure.
XML is normally encoded as UTF-8, rarely if ever in UTF-16.
Further, UTF-16 files contain many nulls which is incompatible with normal C string handling. This means that programs need to be specially written to handle UTF-16 files. On the other hand, legacy programs can generally handle UTF-8 encoded files even if they contain non-ASCII characters.
Size issues
UTF-32 requires four bytes to encode any character. Since characters outside the basic multilingual plane are rare, a document encoded in UTF-32 will usually be nearly twice as large as its UTF-16–encoded equivalent because UTF-16 only uses two bytes for the characters inside the basic multilingual plane, or at most four bytes for other, rare characters.UTF-8 uses anywhere between one and four bytes to encode a character. It requires one byte for most Latin characters, making it half the space of UTF-16 for texts consisting mostly of Latin characters. For most non-Latin alphabets it requires two bytes, the same as UTF-16. There are a few, fairly rarely used codes that UTF-8 requires three bytes whereas UTF-16 requires only two. And very rarely used characters need four bytes in UTF-16 and three or four bytes in UTF-8.
All printable characters in UTF-EBCDIC use at least as many bytes as in UTF-8 and most use more, due to a decision made to allow encoding the C1 control codes as single bytes.
For seven-bit environments, UTF-7 clearly wins over the combination of other Unicode encodings with quoted printable or base64.
Processing Issues
For processing, a format should be easy to search, truncate, and generally process safely. All normal unicode encodings use some form of fixed size code unit. Depending on the format and the code point to be encoded one or more of these code units will represent a Unicode code point. To allow easy searching and truncation a sequence must not occur within a longer sequence or across the boundary of two other sequences. UTF-8, UTF-16, UTF-32 and UTF-EBCDIC have these important properties but UTF-7 and GB18030 do not.Fixed-size characters can be helpful, but it should be remembered that even if there is a fixed byte count per code point (as in UTF-32), there is not a fixed byte count per displayed character due to combining characters. If you are working with a particular API heavily and that API has standardised on a particular Unicode encoding it is generally a good idea to use the encoding that the API does to avoid the need to convert before every call to the API. Similarly if you are writing server side software it may simplify matters to use the same format for processing that you are communicating in.
UTF-16 is popular because many APIs date to the time when Unicode was 16-bit fixed width. Unfortunately using UTF-16 makes characters outside the Basic Multilingual Plane a special case which increases the risk of oversights related to their handling. That said, programs that mishandle surrogate pairs probably also have problems with combining sequences, so using UTF-32 is unlikely to solve the more general problem of poor handling of multi-code-unit characters.
For communication and storage
Some protocols and file formats may be limited to a specific set of encodings, but even when they are not some encodings may offer better compatibility than others with existing implementations. Also the cost of converting between your processing format and your communication format should be considered both in terms of program size (e.g. GB18030 requires a huge mapping table) and run-time requirements.UTF-16 and UTF-32 are not byte oriented and so a byte order must be selected when transmitting them over a byte oriented network or storing them in a byte oriented file. This may be achieved by standardising on a single byte order, by specifying the endianness as part of external metadata (for example the MIME charset registry has distinct UTF-16BE and UTF-16LE registrations as well as the plain UTF-16 one) or by using a Byte Order Mark at the start of the text. UTF-8 does not have these problems.
If the bytestream is subject to corruption then some encodings recover better than others. UTF-8 and UTF-EBCDIC are best in this regard as they can always resynchronise at the start of the next good character. UTF-16 and UTF-32 will handle corrupt bytes well (again recovering on the next good character) but a lost byte will garble all following text. GB18030 may be thrown out of sync by a corrupt or missing byte and has no designed in recovery.
In detail
The tables below list the number of bytes per code point for different Unicode ranges. Any additional comments needed are included in the table. The figures assume that overheads at the start and end of the block of text are negligible.N.B. The tables below list numbers of bytes per code point, not per user visible "character" (or "grapheme cluster"). It can take multiple code points to describe a single grapheme cluster, so even in UTF-32, care must be taken when splitting or concatenating strings.
Eight-bit environments
| Code range (hexadecimal) | UTF-8 | UTF-16 | UTF-32 | UTF-EBCDIC | GB18030 |
|---|---|---|---|---|---|
| 000000 – 00007F | 1 | 2 | 4 | 1 | 1 |
| 000080 – 00009F | 2 | 2 for characters inherited from GB2312/GBK (e.g. most Chinese characters) 4 for everything else. | |||
| 0000A0 – 0003FF | 2 | ||||
| 000400 – 0007FF | 3 | ||||
| 000800 – 003FFF | 3 | ||||
| 004000 – 00FFFF | 4 | ||||
| 010000 – 03FFFF | 4 | 4 | 4 | ||
| 040000 – 10FFFF | 5 |
Seven-bit environments
This table may not cover every special case and so should be used for estimation and comparison only. To accurately determine the size of text in an encoding, see the actual specifications.| code range (hexadecimal) | UTF-7 | UTF-8 quoted printable | UTF-8 base64 | UTF-16 quoted printable | UTF-16 base64 | UTF-32 quoted printable | UTF-32 base64 | GB18030 quoted printable | GB18030 base64 |
| 000000 – 000032 | same as 000080–00FFFF | 3 | 1? | 6 | 2? | 12 | 5? | 3 | 1? |
| 000033 – 00003C | 1 for "direct characters" and possibly "optional direct characters" (depending on the encoder setting) 2 for +, otherwise same as 000080–00FFFF | 1 | 4 | 10 | 1 | ||||
| 00003D (equals sign) | 3 | 6 | 12 | 3 | |||||
| 00003E – 00007E | 1 | 4 | 10 | 1 | |||||
| 00007F | 5 for an isolated case inside a run of single byte characters. For runs 2⅔ per character plus padding to make it a whole number of bytes plus two to start and finish the run | 3 | 6 | 12 | 3 | ||||
| 000080 – 0007FF | 6 | 2? | 2–6 depending on if the byte values need to be escaped | 8–12 depending on if the final two byte values need to be escaped | 4–6 for characters inherited from GB2312/GBK (e.g. most Chinese characters) 8 for everything else. | 2⅔ for characters inherited from GB2312/GBK (e.g. most Chinese characters) 5⅓ for everything else. | |||
| 000800 – 00FFFF | 9 | 4 | |||||||
| 010000 – 10FFFF | 8 for isolated case, 5⅓ per character plus padding to integer plus 2 for a run | 12 | 5? | 8–12 depending on if the low bytes of the surrogates need to be escaped. | 5? | 8 | 5? |
Not yet developed: UTF-6 and UTF-5
Some proposals have been made for a UTF-6 and UTF-5 for radio telegraphy environments[1], however no formal UTF standard has been formalized as of 2006.- These proposals are not related to Punycode.
Not being seriously pursued: UTF-9 and UTF-18
RFC 4042 specifies "UTF-9 and UTF-18 Efficient Transformation Formats of Unicode", but is not being actively pursued. It was released on April 1, 2005 and is of marginal use in computers having other than 36-bit word lengths.References
1. ^ Seng, James, UTF-5, a transformation format of Unicode and ISO 10646, 28 Jan 2000, retrieved 23 Aug 2007
Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard
..... Click the link for more information.
..... Click the link for more information.
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages.
..... Click the link for more information.
..... Click the link for more information.
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII.
..... Click the link for more information.
..... Click the link for more information.
"Compatibility Encoding Scheme for UTF-16: 8-Bit" (CESU-8) is a variant of UTF-8 that is described in Unicode Technical Report #26 [1] . A Unicode code point from the Basic Multilingual Plane (BMP), i.e.
..... Click the link for more information.
..... Click the link for more information.
In computing, UTF-16 (16-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, capable of encoding the entire Unicode repertoire. The encoding form maps code points (characters) into a sequence of 16-bit words, called code units.
..... Click the link for more information.
..... Click the link for more information.
UTF-32 and UCS-4 are alternative names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. It can be regarded as the simplest encoding form, as all other Unicode Transformation Formats have variable-length
..... Click the link for more information.
..... Click the link for more information.
UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty.
..... Click the link for more information.
..... Click the link for more information.
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks.
..... Click the link for more information.
..... Click the link for more information.
Punycode is a computer programming protocol by which a Unicode string of characters can be translated into the more-limited character set permitted in network host names. The protocol is published on the Internet in Request for Comments #.
..... Click the link for more information.
..... Click the link for more information.
internationalized domain name (IDN) is an Internet domain name that (potentially) contains non-ASCII characters. Such domain names could contain letters with diacritics, as required by many European languages, or characters from non-Latin scripts such as Arabic or Chinese.
..... Click the link for more information.
..... Click the link for more information.
GB18030 is the registered Internet name for the official character set of the People's Republic of China (PRC) superseding GB2312. This character set is formally called "Chinese National Standard GB 18030-2000: Information Technology -- Chinese ideograms coded character set for
..... Click the link for more information.
..... Click the link for more information.
The international standard ISO/IEC 10646 defines the Universal Character Set (UCS) as a character set on which many encodings are based. It contains nearly a hundred thousand abstract characters, each identified by an unambiguous name and an integer number called its
..... Click the link for more information.
..... Click the link for more information.
Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points.
As of Unicode 5.0.0, 102,012 (9.
..... Click the link for more information.
As of Unicode 5.0.0, 102,012 (9.
..... Click the link for more information.
bi-directional text. This can get rather complex when multiple levels of quotation are used.
Many computer programs fail to display bi-directional text correctly. For example, the Hebrew name Sarah (שרה) should be spelled shin (ש) resh (ר) heh
..... Click the link for more information.
Many computer programs fail to display bi-directional text correctly. For example, the Hebrew name Sarah (שרה) should be spelled shin (ש) resh (ר) heh
..... Click the link for more information.
A byte-order mark (BOM) is the Unicode character at code point U+FEFF ("zero-width no-break space") when that character is used to denote the endianness of a string of UCS/Unicode characters encoded in UTF-16 or UTF-32.
..... Click the link for more information.
..... Click the link for more information.
Han unification is an effort by the authors of Unicode and the Universal Character Set to map multiple character sets of the so-called CJK languages into a single set of unified characters.
..... Click the link for more information.
..... Click the link for more information.
hypertext markup language (HTML) may contain multilingual text represented with the Unicode universal character set.
The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike.
..... Click the link for more information.
The relationship between Unicode and HTML tends to be a difficult topic for many computer professionals, document authors, and web users alike.
..... Click the link for more information.
Many E-mail clients now offer some support for Unicode in E-mail bodies. Most do not send in Unicode by default, and few systems are likely to be set up with fonts capable of displaying the full range of Unicode characters.
..... Click the link for more information.
..... Click the link for more information.
Unicode typefaces (also known as UCS fonts and Unicode fonts) are typefaces containing a wide range of characters, letters, digits, glyphs, symbols, ideograms, logograms, etc.
..... Click the link for more information.
..... Click the link for more information.
Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in any of the world's writing systems. Developed in tandem with the Universal Character Set standard and published in book form as The Unicode Standard
..... Click the link for more information.
..... Click the link for more information.
Simple Mail Transfer Protocol (SMTP) is the de facto standard for e-mail transmissions across the Internet. Formally SMTP is defined in RFC 821 (STD 10) as amended by RFC 1123 (STD 3) chapter 5. The protocol used today is also known as ESMTP and defined in RFC 2821.
..... Click the link for more information.
..... Click the link for more information.
byte (pronounced /baɪt/) is a unit of measurement of information storage, most often consisting of eight bits. In many computer architectures it is a unit of memory addressing.
..... Click the link for more information.
..... Click the link for more information.
The Standard Compression Scheme for Unicode (SCSU) is a Unicode Technical Standard for reducing the number of bytes needed to represent Unicode text, especially if that text uses mostly characters from one or a small number of per-language character blocks.
..... Click the link for more information.
..... Click the link for more information.
BOCU-1 is a MIME compatible Unicode compression scheme. BOCU stands for Binary Ordered Compression for Unicode. BOCU-1 combines the wide applicability of UTF-8 with the compactness of SCSU.
..... Click the link for more information.
..... Click the link for more information.
UTF-32 and UCS-4 are alternative names for a method of encoding Unicode characters, using the fixed amount of exactly 32 bits for each Unicode code point. It can be regarded as the simplest encoding form, as all other Unicode Transformation Formats have variable-length
..... Click the link for more information.
..... Click the link for more information.
Unicode’s Universal Character Set potentially supports over 1 million (1,114,112 = 220 + 216 or 17 × 216, hexadecimal 110000) code points.
As of Unicode 5.0.0, 102,012 (9.
..... Click the link for more information.
As of Unicode 5.0.0, 102,012 (9.
..... Click the link for more information.
UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII.
..... Click the link for more information.
..... Click the link for more information.
UTF-EBCDIC is a character encoding used to represent Unicode characters. It is meant to be EBCDIC-friendly, so that legacy EBCDIC applications on mainframes may process the characters without much difficulty.
..... Click the link for more information.
..... Click the link for more information.
UTF-7 (7-bit Unicode Transformation Format) is a variable-length character encoding that was proposed for representing Unicode-encoded text using a stream of ASCII characters, for example for use in Internet e-mail messages.
..... Click the link for more information.
..... Click the link for more information.
Quoted-printable, or QP encoding, is an encoding using printable characters (i.e. alphanumeric and the equals sign "=") to transmit 8-bit data over a 7-bit data path. It is defined as a MIME content transfer encoding for use in Internet e-mail.
..... Click the link for more information.
..... Click the link for more information.
This article is copied from an article on Wikipedia.org - the free encyclopedia created and edited by online user community. The text was not checked or edited by anyone on our staff. Although the vast majority of the wikipedia encyclopedia articles provide accurate and timely information please do not assume the accuracy of any particular article. This article is distributed under the terms of GNU Free Documentation License.
Herod_Archelaus