Network Working Group P. Hoffman Request for Comments: 2781 Internet Mail Consortium Category: Informational F. Yergeau Alis Technologies February 2000 UTF-16, an encoding of ISO 10646 Status of this Memo This memo provides information for the Internet community. It does not specify an Internet standard of any kind. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The Internet Society (2000). All Rights Reserved. CHARSET-REG], and contains the registration for three MIME charset parameter values: UTF-16BE (big-endian), UTF-16LE (little-endian), and UTF-16. UNICODE] and ISO/IEC 10646 [ISO-10646] jointly define a coded character set (CCS), hereafter referred to as Unicode, which encompasses most of the world's writing systems [WORKSHOP]. UTF-16, the object of this specification, is one of the standard ways of encoding Unicode character data; it has the characteristics of encoding all currently defined characters (in plane 0, the BMP) in exactly two octets and of being able to encode all other characters likely to be defined (the next 16 planes) in exactly four octets. The Unicode Standard further defines additional character properties and other application details of great interest to implementors. Up to the present time, changes in Unicode and amendments to ISO/IEC 10646 have tracked each other, so that the character repertoires and code point assignments have remained in sync. The relevant standardization committees have committed to maintain this very useful synchronism, as well as not to assign characters outside of the 17 planes accessible to UTF-16.
The IETF policy on character sets and languages [CHARPOLICY] says that IETF protocols MUST be able to use the UTF-8 character encoding scheme [UTF-8]. Some products and network standards already specify UTF-16, making it an important encoding for the Internet. This document is not an update to the [CHARPOLICY] document, only a description of the UTF-16 encoding. RFC 2119 [MUSTSHOULD]. Throughout this document, character values are shown in hexadecimal notation. For example, "0x013C" is the character whose value is the character assigned the integer value 316 (decimal) in the CCS. UNICODE]. The definitive reference is Annex Q of ISO/IEC 10646-1 [ISO-10646]. The rest of this section summarizes the definition is simple terms. In ISO 10646, each character is assigned a number, which Unicode calls the Unicode scalar value. This number is the same as the UCS-4 value of the character, and this document will refer to it as the "character value" for brevity. In the UTF-16 encoding, characters are represented using either one or two unsigned 16-bit integers, depending on the character value. Serialization of these integers for transmission as a byte stream is discussed in Section 3. The rules for how characters are encoded in UTF-16 are: - Characters with values less than 0x10000 are represented as a single 16-bit integer with a value equal to that of the character number. - Characters with values between 0x10000 and 0x10FFFF are represented by a 16-bit integer with a value between 0xD800 and 0xDBFF (within the so-called high-half zone or high surrogate area) followed by a 16-bit integer with a value between 0xDC00 and 0xDFFF (within the so-called low-half zone or low surrogate area). - Characters with values greater than 0x10FFFF cannot be encoded in UTF-16. Note: Values between 0xD800 and 0xDFFF are specifically reserved for use with UTF-16, and don't have any characters assigned to them.
5) Add 0x10000 to U' to obtain the character value U. Terminate. Note that steps 2 and 3 indicate errors. Error recovery is not specified by this document. When terminating with an error in steps 2 and 3, it may be wise to set U to the value of W1 to help the caller diagnose the error and not lose information. Also note that a string decoding algorithm, as opposed to the single-character decoding described above, need not terminate upon detection of an error, if proper error reporting and/or recovery is provided. Appendix A of this specification contains registrations for three MIME charsets: "UTF-16BE", "UTF-16LE", and "UTF-16". MIME charsets represent the combination of a CCS (a coded character set) and a CES (a character encoding scheme). Here the CCS is Unicode/ISO 10646 and the CES is the same in all three cases, except for the serialization order of the octets in each character, and the external determination of which serialization is used. This section describes which of the three labels to apply to a stream of text. Section 4 describes how to interpret the labels on a stream of text.
The term "network byte order" has been used in many RFCs to indicate big-endian serialization, although that term has yet to be formally defined in a standards-track document. Although ISO 10646 prefers big-endian serialization (section 6.3 of [ISO-10646]), little-endian order is also sometimes used on the Internet.
NON-BREAKING SPACE" at the connection point. Also, some specifications mandate an initial 0xFEFF character in objects labelled as UTF-16 and specify that this signature is not part of the object.
Big-endian text labelled with UTF-16, with a BOM: FE FF D8 08 DF 45 00 3D 00 52 00 61 Little-endian text labelled with UTF-16, with a BOM: FF FE 08 D8 45 DF 3D 00 52 00 61 00 Appendix A. Appendixes A.1, A.2, and A.3 according to RFC 2278, using registration templates found in those appendixes. Section 6 and Appendix A of this document. Processors must be able to handle characters that are not defined at the time that the processor was created in such a way as to not allow an attacker to harm a recipient by including unknown characters. Processors that handle any type of text, including text encoded as UTF-16, must be vigilant in checking for control characters that might reprogram a display terminal or keyboard. Similarly, processors
that interpret text entities (such as looking for embedded programming code), must be careful not to execute the code without first alerting the recipient. Text in UTF-16 may contain special characters, such as the OBJECT REPLACEMENT CHARACTER (0xFFFC), that might cause external processing, depending on the interpretation of the processing program and the availability of an external data stream that would be executed. This external processing may have side-effects that allow the sender of a message to attack the receiving system. Implementors of UTF-16 need to consider the security aspects of how they handle illegal UTF-16 sequences (that is, sequences involving surrogate pairs that have illegal values or unpaired surrogates). It is conceivable that in some circumstances an attacker would be able to exploit an incautious UTF-16 parser by sending it an octet sequence that is not permitted by the UTF-16 syntax, causing it to behave in some anomalous fashion. [CHARPOLICY] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998. [CHARSET-REG] Freed, N. and J. Postel, "IANA Charset Registration Procedures", BCP 19, RFC 2278, January 1998. [HTTP-1.1] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P. and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. [ISO-10646] ISO/IEC 10646-1:1993. International Standard -- Information technology -- Universal Multiple-Octet Coded Character Set (UCS) -- Part 1: Architecture and Basic Multilingual Plane. 22 amendments and two technical corrigenda have been published up to now. UTF-16 is described in Annex Q, published as Amendment 1. Many other amendments are currently at various stages of standardization. A second edition is in preparation, probably to be published in 2000; in this new edition, UTF-16 will probably be described in Annex C. [MUSTSHOULD] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [UNICODE] The Unicode Consortium, "The Unicode Standard -- Version 3.0", ISBN 0-201-61633-5. Described at
<http://www.unicode.org/unicode/standard/versions/Unicode3.0.html>. [UTF-8] Yergeau, F., "UTF-8, a transformation format of ISO 10646", RFC 2279, January 1998. [WORKSHOP] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin., M. and P. Svanberg, "Report of the IAB Character Set Workshop", RFC 2130, April 1997. UTF-8], and that document was worked on by many people. Please see the acknowledgments section in that document for more people who may have contributed indirectly to this document.
CHARSET-REG]. The proposed charsets are "UTF-16BE", "UTF-16LE", and "UTF-16". These strings label objects containing text consisting of characters from the repertoire of ISO/IEC 10646 including all amendments at least up to amendment 5 (Korean block), encoded to a sequence of octets using the encoding and serialization schemes outlined above. Note that "UTF-16BE", "UTF-16LE", and "UTF-16" are NOT suitable for use in media types under the "text" top-level type, because they do not encode line endings in the way required for MIME "text" media types. An exception to this is HTTP, which uses a MIME-like mechanism, but is exempt from the restrictions on the text top-level type (see section 19.4.2 of HTTP 1.1 [HTTP-1.1]). It is noteworthy that the labels described here do not contain a version identification, referring generically to ISO/IEC 10646. This is intentional, the rationale being as follows: A MIME charset is designed to give just the information needed to interpret a sequence of bytes received on the wire into a sequence of characters, nothing more (see RFC 2045, section 2.2, in [MIME]). As long as a character set standard does not change incompatibly, version numbers serve no purpose, because one gains nothing by learning from the tag that newly assigned characters may be received that one doesn't know about. The tag itself doesn't teach anything about the new characters, which are going to be received anyway. Hence, as long as the standards evolve compatibly, the apparent advantage of having labels that identify the versions is only that, apparent. But there is a disadvantage to such version-dependent labels: when an older application receives data accompanied by a newer, unknown label, it may fail to recognize the label and be completely unable to deal with the data, whereas a generic, known label would have triggered mostly correct processing of the data, which may well not contain any new characters. The "Korean mess" (ISO/IEC 10646 amendment 5) is an incompatible change, in principle contradicting the appropriateness of a version independent MIME charset as described above. But the compatibility problem can only appear with data containing Korean Hangul characters encoded according to Unicode 1.1 (or equivalently ISO/IEC 10646 before amendment 5), and there is arguably no such data to worry about, this being the very reason the incompatible change was deemed acceptable.
In practice, then, a version-independent label is warranted, provided the label is understood to refer to all versions after Amendment 5, and provided no incompatible change actually occurs. Should incompatible changes occur in a later version of ISO/IEC 10646, the MIME charsets defined here will stay aligned with the previous version until and unless the IETF specifically decides otherwise.
Suitable for use in MIME content types under the "text" top-level type: No Person & email address to contact for further information: Paul Hoffman <firstname.lastname@example.org> Francois Yergeau <email@example.com>
Full Copyright Statement Copyright (C) The Internet Society (2000). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society.