Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 3987

Internationalized Resource Identifiers (IRIs)

Pages: 46
Proposed Standard
Errata
Part 1 of 3 – Pages 1 to 9
None   None   Next

Top   ToC   RFC3987 - Page 1
Network Working Group                                          M. Duerst
Request for Comments: 3987                                           W3C
Category: Standards Track                                    M. Suignard
                                                   Microsoft Corporation
                                                            January 2005


             Internationalized Resource Identifiers (IRIs)

Status of This Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2005).

Abstract

This document defines a new protocol element, the Internationalized Resource Identifier (IRI), as a complement to the Uniform Resource Identifier (URI). An IRI is a sequence of characters from the Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to URIs is defined, which means that IRIs can be used instead of URIs, where appropriate, to identify resources. The approach of defining a new protocol element was chosen instead of extending or changing the definition of URIs. This was done in order to allow a clear distinction and to avoid incompatibilities with existing software. Guidelines are provided for the use and deployment of IRIs in various protocols, formats, and software components that currently deal with URIs.

Table of Contents

1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 3 1.1. Overview and Motivation . . . . . . . . . . . . . . . . 3 1.2. Applicability . . . . . . . . . . . . . . . . . . . . . 3 1.3. Definitions . . . . . . . . . . . . . . . . . . . . . . 4 1.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 5 2. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . 6 2.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 7
Top   ToC   RFC3987 - Page 2
   3.  Relationship between IRIs and URIs . . . . . . . . . . . . . . 10
       3.1.  Mapping of IRIs to URIs  . . . . . . . . . . . . . . . . 10
       3.2.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . 14
             3.2.1.  Examples . . . . . . . . . . . . . . . . . . . . 15
   4.  Bidirectional IRIs for Right-to-Left Languages.  . . . . . . . 16
       4.1.  Logical Storage and Visual Presentation  . . . . . . . . 17
       4.2.  Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 18
       4.3.  Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 19
       4.4.  Examples . . . . . . . . . . . . . . . . . . . . . . . . 19
   5.  Normalization and Comparison . . . . . . . . . . . . . . . . . 21
       5.1.  Equivalence  . . . . . . . . . . . . . . . . . . . . . . 22
       5.2.  Preparation for Comparison . . . . . . . . . . . . . . . 22
       5.3.  Comparison Ladder  . . . . . . . . . . . . . . . . . . . 23
             5.3.1.  Simple String Comparison . . . . . . . . . . . . 23
             5.3.2.  Syntax-Based Normalization . . . . . . . . . . . 24
             5.3.3.  Scheme-Based Normalization . . . . . . . . . . . 27
             5.3.4.  Protocol-Based Normalization . . . . . . . . . . 28
   6.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 29
       6.1.  Limitations on UCS Characters Allowed in IRIs  . . . . . 29
       6.2.  Software Interfaces and Protocols  . . . . . . . . . . . 29
       6.3.  Format of URIs and IRIs in Documents and Protocols . . . 30
       6.4.  Use of UTF-8 for Encoding Original Characters .. . . . . 30
       6.5.  Relative IRI References  . . . . . . . . . . . . . . . . 32
   7.  URI/IRI Processing Guidelines (informative)  . . . . . . . . . 32
       7.1.  URI/IRI Software Interfaces  . . . . . . . . . . . . . . 32
       7.2.  URI/IRI Entry  . . . . . . . . . . . . . . . . . . . . . 33
       7.3.  URI/IRI Transfer between Applications  . . . . . . . . . 33
       7.4.  URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34
       7.5.  URI/IRI Selection  . . . . . . . . . . . . . . . . . . . 34
       7.6.  Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 35
       7.7.  Interpretation of URIs and IRIs  . . . . . . . . . . . . 36
       7.8.  Upgrading Strategy . . . . . . . . . . . . . . . . . . . 36
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 37
   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39
   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
       10.1. Normative References . . . . . . . . . . . . . . . . . . 40
       10.2. Informative References . . . . . . . . . . . . . . . . . 41
   A.  Design Alternatives  . . . . . . . . . . . . . . . . . . . . . 44
       A.1.  New Scheme(s)  . . . . . . . . . . . . . . . . . . . . . 44
       A.2.  Character Encodings Other Than UTF-8 . . . . . . . . . . 44
       A.3.  New Encoding Convention  . . . . . . . . . . . . . . . . 44
       A.4.  Indicating Character Encodings in the URI/IRI  . . . . . 45
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45
   Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 46
Top   ToC   RFC3987 - Page 3

1. Introduction

1.1. Overview and Motivation

A Uniform Resource Identifier (URI) is defined in [RFC3986] as a sequence of characters chosen from a limited subset of the repertoire of US-ASCII [ASCII] characters. The characters in URIs are frequently used for representing words of natural languages. This usage has many advantages: Such URIs are easier to memorize, easier to interpret, easier to transcribe, easier to create, and easier to guess. For most languages other than English, however, the natural script uses characters other than A - Z. For many people, handling Latin characters is as difficult as handling the characters of other scripts is for those who use only the Latin alphabet. Many languages with non-Latin scripts are transcribed with Latin letters. These transcriptions are now often used in URIs, but they introduce additional ambiguities. The infrastructure for the appropriate handling of characters from local scripts is now widely deployed in local versions of operating system and application software. Software that can handle a wide variety of scripts and languages at the same time is increasingly common. Also, increasing numbers of protocols and formats can carry a wide range of characters. This document defines a new protocol element called Internationalized Resource Identifier (IRI) by extending the syntax of URIs to a much wider repertoire of characters. It also defines "internationalized" versions corresponding to other constructs from [RFC3986], such as URI references. The syntax of IRIs is defined in section 2, and the relationship between IRIs and URIs in section 3. Using characters outside of A - Z in IRIs brings some difficulties. Section 4 discusses the special case of bidirectional IRIs, section 5 various forms of equivalence between IRIs, and section 6 the use of IRIs in different situations. Section 7 gives additional informative guidelines, and section 8 security considerations.

1.2. Applicability

IRIs are designed to be compatible with recommendations for new URI schemes [RFC2718]. The compatibility is provided by specifying a well-defined and deterministic mapping from the IRI character sequence to the functionally equivalent URI character sequence. Practical use of IRIs (or IRI references) in place of URIs (or URI references) depends on the following conditions being met:
Top   ToC   RFC3987 - Page 4
   a.  A protocol or format element should be explicitly designated to
       be able to carry IRIs.  The intent is not to introduce IRIs into
       contexts that are not defined to accept them.  For example, XML
       schema [XMLSchema] has an explicit type "anyURI" that includes
       IRIs and IRI references. Therefore, IRIs and IRI references can
       be in attributes and elements of type "anyURI".  On the other
       hand, in the HTTP protocol [RFC2616], the Request URI is defined
       as a URI, which means that direct use of IRIs is not allowed in
       HTTP requests.

   b.  The protocol or format carrying the IRIs should have a mechanism
       to represent the wide range of characters used in IRIs, either
       natively or by some protocol- or format-specific escaping
       mechanism (for example, numeric character references in [XML1]).

   c.  The URI corresponding to the IRI in question has to encode
       original characters into octets using UTF-8.  For new URI
       schemes, this is recommended in [RFC2718].  It can apply to a
       whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384],
       or the URN syntax [RFC2141]).  It can apply to a specific part of
       a URI, such as the fragment identifier (e.g., [XPointer]).  It
       can apply to a specific URI or part(s) thereof.  For details,
       please see section 6.4.

1.3. Definitions

The following definitions are used in this document; they follow the terms in [RFC2130], [RFC2277], and [ISO10646]. character: A member of a set of elements used for the organization, control, or representation of data. For example, "LATIN CAPITAL LETTER A" names a character. octet: An ordered sequence of eight bits considered as a unit. character repertoire: A set of characters (in the mathematical sense). sequence of characters: A sequence of characters (one after another). sequence of octets: A sequence of octets (one after another). character encoding: A method of representing a sequence of characters as a sequence of octets (maybe with variants). Also, a method of (unambiguously) converting a sequence of octets into a sequence of characters.
Top   ToC   RFC3987 - Page 5
   charset: The name of a parameter or attribute used to identify a
      character encoding.

   UCS: Universal Character Set. The coded character set defined by
      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].

   IRI reference: Denotes the common usage of an Internationalized
      Resource Identifier.  An IRI reference may be absolute or
      relative.  However, the "IRI" that results from such a reference
      only includes absolute IRIs; any relative IRI references are
      resolved to their absolute form.  Note that in [RFC2396] URIs did
      not include fragment identifiers, but in [RFC3986] fragment
      identifiers are part of URIs.

   running text: Human text (paragraphs, sentences, phrases) with syntax
      according to orthographic conventions of a natural language, as
      opposed to syntax defined for ease of processing by machines
      (e.g., markup, programming languages).

   protocol element: Any portion of a message that affects processing of
      that message by the protocol in question.

   presentation element: A presentation form corresponding to a protocol
      element; for example, using a wider range of characters.

   create (a URI or IRI): With respect to URIs and IRIs, the term is
      used for the initial creation.  This may be the initial creation
      of a resource with a certain identifier, or the initial exposition
      of a resource under a particular identifier.

   generate (a URI or IRI): With respect to URIs and IRIs, the term is
      used when the IRI is generated by derivation from other
      information.

1.4. Notation

RFCs and Internet Drafts currently do not allow any characters outside the US-ASCII repertoire. Therefore, this document uses various special notations to denote such characters in examples. In text, characters outside US-ASCII are sometimes referenced by using a prefix of 'U+', followed by four to six hexadecimal digits. To represent characters outside US-ASCII in examples, this document uses two notations: 'XML Notation' and 'Bidi Notation'.
Top   ToC   RFC3987 - Page 6
   XML Notation uses a leading '&#x', a trailing ';', and the
   hexadecimal number of the character in the UCS in between.  For
   example, я stands for CYRILLIC CAPITAL LETTER YA.  In this
   notation, an actual '&' is denoted by '&'.

   Bidi Notation is used for bidirectional examples: Lowercase letters
   stand for Latin letters or other letters that are written left to
   right, whereas uppercase letters represent Arabic or Hebrew letters
   that are written right to left.

   To denote actual octets in examples (as opposed to percent-encoded
   octets), the two hex digits denoting the octet are enclosed in "<"
   and ">".  For example, the octet often denoted as 0xc9 is denoted
   here as <c9>.

   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY",
   and "OPTIONAL" are to be interpreted as described in [RFC2119].

2. IRI Syntax

This section defines the syntax of Internationalized Resource Identifiers (IRIs). As with URIs, an IRI is defined as a sequence of characters, not as a sequence of octets. This definition accommodates the fact that IRIs may be written on paper or read over the radio as well as stored or transmitted digitally. The same IRI may be represented as different sequences of octets in different protocols or documents if these protocols or documents use different character encodings (and/or transfer encodings). Using the same character encoding as the containing protocol or document ensures that the characters in the IRI can be handled (e.g., searched, converted, displayed) in the same way as the rest of the protocol or document.

2.1. Summary of IRI Syntax

IRIs are defined similarly to URIs in [RFC3986], but the class of unreserved characters is extended by adding the characters of the UCS (Universal Character Set, [ISO10646]) beyond U+007F, subject to the limitations given in the syntax rules below and in section 6.1. Otherwise, the syntax and use of components and reserved characters is the same as that in [RFC3986]. All the operations defined in [RFC3986], such as the resolution of relative references, can be applied to IRIs by IRI-processing software in exactly the same way as they are for URIs by URI-processing software.
Top   ToC   RFC3987 - Page 7
   Characters outside the US-ASCII repertoire are not reserved and
   therefore MUST NOT be used for syntactical purposes, such as to
   delimit components in newly defined schemes.  For example, U+00A2,
   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
   the 'iunreserved' category. This is similar to the fact that it is
   not possible to use '-' as a delimiter in URIs, because it is in the
   'unreserved' category.

2.2. ABNF for IRI References and IRIs

Although it might be possible to define IRI references and IRIs merely by their transformation to URI references and URIs, they can also be accepted and processed directly. Therefore, an ABNF definition for IRI references (which are the most general concept and the start of the grammar) and IRIs is given here. The syntax of this ABNF is described in [RFC2234]. Character numbers are taken from the UCS, without implying any actual binary encoding. Terminals in the ABNF are characters, not bytes. The following grammar closely follows the URI grammar in [RFC3986], except that the range of unreserved characters is expanded to include UCS characters, with the restriction that private UCS characters can occur only in query parts. The grammar is split into two parts: Rules that differ from [RFC3986] because of the above-mentioned expansion, and rules that are the same as those in [RFC3986]. For rules that are different than those in [RFC3986], the names of the non-terminals have been changed as follows. If the non-terminal contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i' has been prefixed. The following rules are different from those in [RFC3986]: IRI = scheme ":" ihier-part [ "?" iquery ] [ "#" ifragment ] ihier-part = "//" iauthority ipath-abempty / ipath-absolute / ipath-rootless / ipath-empty IRI-reference = IRI / irelative-ref absolute-IRI = scheme ":" ihier-part [ "?" iquery ] irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ] irelative-part = "//" iauthority ipath-abempty / ipath-absolute
Top   ToC   RFC3987 - Page 8
                  / ipath-noscheme
                  / ipath-empty

   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
   iuserinfo      = *( iunreserved / pct-encoded / sub-delims / ":" )
   ihost          = IP-literal / IPv4address / ireg-name

   ireg-name      = *( iunreserved / pct-encoded / sub-delims )

   ipath          = ipath-abempty   ; begins with "/" or is empty
                  / ipath-absolute  ; begins with "/" but not "//"
                  / ipath-noscheme  ; begins with a non-colon segment
                  / ipath-rootless  ; begins with a segment
                  / ipath-empty     ; zero characters

   ipath-abempty  = *( "/" isegment )
   ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
   ipath-noscheme = isegment-nz-nc *( "/" isegment )
   ipath-rootless = isegment-nz *( "/" isegment )
   ipath-empty    = 0<ipchar>

   isegment       = *ipchar
   isegment-nz    = 1*ipchar
   isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
                        / "@" )
                  ; non-zero-length segment without any colon ":"

   ipchar         = iunreserved / pct-encoded / sub-delims / ":"
                  / "@"

   iquery         = *( ipchar / iprivate / "/" / "?" )

   ifragment      = *( ipchar / "/" / "?" )

   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
   "greedy") algorithm applies.  For details, see [RFC3986].
Top   ToC   RFC3987 - Page 9
   The following rules are the same as those in [RFC3986]:

   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

   port           = *DIGIT

   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

   IPv6address    =                            6( h16 ":" ) ls32
                  /                       "::" 5( h16 ":" ) ls32
                  / [               h16 ] "::" 4( h16 ":" ) ls32
                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                  / [ *4( h16 ":" ) h16 ] "::"              ls32
                  / [ *5( h16 ":" ) h16 ] "::"              h16
                  / [ *6( h16 ":" ) h16 ] "::"

   h16            = 1*4HEXDIG
   ls32           = ( h16 ":" h16 ) / IPv4address

   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

   dec-octet      = DIGIT                 ; 0-9
                  / %x31-39 DIGIT         ; 10-99
                  / "1" 2DIGIT            ; 100-199
                  / "2" %x30-34 DIGIT     ; 200-249
                  / "25" %x30-35          ; 250-255

   pct-encoded    = "%" HEXDIG HEXDIG

   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved       = gen-delims / sub-delims
   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

   This syntax does not support IPv6 scoped addressing zone identifiers.