RFC 3987

Internationalized Resource Identifiers (IRIs)

Pages: 46
Proposed Standard
→ Errata

Part 1 of 3 – Pages 1 to 9

RFC3987 - Page 1

Network Working Group                                          M. Duerst
Request for Comments: 3987                                           W3C
Category: Standards Track                                    M. Suignard
                                                   Microsoft Corporation
                                                            January 2005


             Internationalized Resource Identifiers (IRIs)

Status of This Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2005).

Abstract

   This document defines a new protocol element, the Internationalized
   Resource Identifier (IRI), as a complement to the Uniform Resource
   Identifier (URI).  An IRI is a sequence of characters from the
   Universal Character Set (Unicode/ISO 10646).  A mapping from IRIs to
   URIs is defined, which means that IRIs can be used instead of URIs,
   where appropriate, to identify resources.

   The approach of defining a new protocol element was chosen instead of
   extending or changing the definition of URIs.  This was done in order
   to allow a clear distinction and to avoid incompatibilities with
   existing software.  Guidelines are provided for the use and
   deployment of IRIs in various protocols, formats, and software
   components that currently deal with URIs.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
       1.1.  Overview and Motivation  . . . . . . . . . . . . . . . .  3
       1.2.  Applicability  . . . . . . . . . . . . . . . . . . . . .  3
       1.3.  Definitions  . . . . . . . . . . . . . . . . . . . . . .  4
       1.4.  Notation . . . . . . . . . . . . . . . . . . . . . . . .  5
   2.  IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . .  6
       2.1.  Summary of IRI Syntax  . . . . . . . . . . . . . . . . .  6
       2.2.  ABNF for IRI References and IRIs . . . . . . . . . . . .  7

RFC3987 - Page 2

   3.  Relationship between IRIs and URIs . . . . . . . . . . . . . . 10
       3.1.  Mapping of IRIs to URIs  . . . . . . . . . . . . . . . . 10
       3.2.  Converting URIs to IRIs  . . . . . . . . . . . . . . . . 14
             3.2.1.  Examples . . . . . . . . . . . . . . . . . . . . 15
   4.  Bidirectional IRIs for Right-to-Left Languages.  . . . . . . . 16
       4.1.  Logical Storage and Visual Presentation  . . . . . . . . 17
       4.2.  Bidi IRI Structure . . . . . . . . . . . . . . . . . . . 18
       4.3.  Input of Bidi IRIs . . . . . . . . . . . . . . . . . . . 19
       4.4.  Examples . . . . . . . . . . . . . . . . . . . . . . . . 19
   5.  Normalization and Comparison . . . . . . . . . . . . . . . . . 21
       5.1.  Equivalence  . . . . . . . . . . . . . . . . . . . . . . 22
       5.2.  Preparation for Comparison . . . . . . . . . . . . . . . 22
       5.3.  Comparison Ladder  . . . . . . . . . . . . . . . . . . . 23
             5.3.1.  Simple String Comparison . . . . . . . . . . . . 23
             5.3.2.  Syntax-Based Normalization . . . . . . . . . . . 24
             5.3.3.  Scheme-Based Normalization . . . . . . . . . . . 27
             5.3.4.  Protocol-Based Normalization . . . . . . . . . . 28
   6.  Use of IRIs  . . . . . . . . . . . . . . . . . . . . . . . . . 29
       6.1.  Limitations on UCS Characters Allowed in IRIs  . . . . . 29
       6.2.  Software Interfaces and Protocols  . . . . . . . . . . . 29
       6.3.  Format of URIs and IRIs in Documents and Protocols . . . 30
       6.4.  Use of UTF-8 for Encoding Original Characters .. . . . . 30
       6.5.  Relative IRI References  . . . . . . . . . . . . . . . . 32
   7.  URI/IRI Processing Guidelines (informative)  . . . . . . . . . 32
       7.1.  URI/IRI Software Interfaces  . . . . . . . . . . . . . . 32
       7.2.  URI/IRI Entry  . . . . . . . . . . . . . . . . . . . . . 33
       7.3.  URI/IRI Transfer between Applications  . . . . . . . . . 33
       7.4.  URI/IRI Generation . . . . . . . . . . . . . . . . . . . 34
       7.5.  URI/IRI Selection  . . . . . . . . . . . . . . . . . . . 34
       7.6.  Display of URIs/IRIs . . . . . . . . . . . . . . . . . . 35
       7.7.  Interpretation of URIs and IRIs  . . . . . . . . . . . . 36
       7.8.  Upgrading Strategy . . . . . . . . . . . . . . . . . . . 36
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 37
   9.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 39
   10. References . . . . . . . . . . . . . . . . . . . . . . . . . . 40
       10.1. Normative References . . . . . . . . . . . . . . . . . . 40
       10.2. Informative References . . . . . . . . . . . . . . . . . 41
   A.  Design Alternatives  . . . . . . . . . . . . . . . . . . . . . 44
       A.1.  New Scheme(s)  . . . . . . . . . . . . . . . . . . . . . 44
       A.2.  Character Encodings Other Than UTF-8 . . . . . . . . . . 44
       A.3.  New Encoding Convention  . . . . . . . . . . . . . . . . 44
       A.4.  Indicating Character Encodings in the URI/IRI  . . . . . 45
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 45
   Full Copyright Statement . . . . . . . . . . . . . . . . . . . . . 46

RFC3987 - Page 3

1.  Introduction

1.1.  Overview and Motivation

   A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
   sequence of characters chosen from a limited subset of the repertoire
   of US-ASCII [ASCII] characters.

   The characters in URIs are frequently used for representing words of
   natural languages.  This usage has many advantages: Such URIs are
   easier to memorize, easier to interpret, easier to transcribe, easier
   to create, and easier to guess.  For most languages other than
   English, however, the natural script uses characters other than A -
   Z. For many people, handling Latin characters is as difficult as
   handling the characters of other scripts is for those who use only
   the Latin alphabet.  Many languages with non-Latin scripts are
   transcribed with Latin letters.  These transcriptions are now often
   used in URIs, but they introduce additional ambiguities.

   The infrastructure for the appropriate handling of characters from
   local scripts is now widely deployed in local versions of operating
   system and application software.  Software that can handle a wide
   variety of scripts and languages at the same time is increasingly
   common.  Also, increasing numbers of protocols and formats can carry
   a wide range of characters.

   This document defines a new protocol element called Internationalized
   Resource Identifier (IRI) by extending the syntax of URIs to a much
   wider repertoire of characters.  It also defines "internationalized"
   versions corresponding to other constructs from [RFC3986], such as
   URI references.  The syntax of IRIs is defined in section 2, and the
   relationship between IRIs and URIs in section 3.

   Using characters outside of A - Z in IRIs brings some difficulties.
   Section 4 discusses the special case of bidirectional IRIs, section 5
   various forms of equivalence between IRIs, and section 6 the use of
   IRIs in different situations.  Section 7 gives additional informative
   guidelines, and section 8 security considerations.

1.2.  Applicability

   IRIs are designed to be compatible with recommendations for new URI
   schemes [RFC2718].  The compatibility is provided by specifying a
   well-defined and deterministic mapping from the IRI character
   sequence to the functionally equivalent URI character sequence.
   Practical use of IRIs (or IRI references) in place of URIs (or URI
   references) depends on the following conditions being met:

RFC3987 - Page 4

   a.  A protocol or format element should be explicitly designated to
       be able to carry IRIs.  The intent is not to introduce IRIs into
       contexts that are not defined to accept them.  For example, XML
       schema [XMLSchema] has an explicit type "anyURI" that includes
       IRIs and IRI references. Therefore, IRIs and IRI references can
       be in attributes and elements of type "anyURI".  On the other
       hand, in the HTTP protocol [RFC2616], the Request URI is defined
       as a URI, which means that direct use of IRIs is not allowed in
       HTTP requests.

   b.  The protocol or format carrying the IRIs should have a mechanism
       to represent the wide range of characters used in IRIs, either
       natively or by some protocol- or format-specific escaping
       mechanism (for example, numeric character references in [XML1]).

   c.  The URI corresponding to the IRI in question has to encode
       original characters into octets using UTF-8.  For new URI
       schemes, this is recommended in [RFC2718].  It can apply to a
       whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384],
       or the URN syntax [RFC2141]).  It can apply to a specific part of
       a URI, such as the fragment identifier (e.g., [XPointer]).  It
       can apply to a specific URI or part(s) thereof.  For details,
       please see section 6.4.

1.3.  Definitions

   The following definitions are used in this document; they follow the
   terms in [RFC2130], [RFC2277], and [ISO10646].

   character: A member of a set of elements used for the organization,
      control, or representation of data.  For example, "LATIN CAPITAL
      LETTER A" names a character.

   octet: An ordered sequence of eight bits considered as a unit.

   character repertoire: A set of characters (in the mathematical
      sense).

   sequence of characters: A sequence of characters (one after another).

   sequence of octets: A sequence of octets (one after another).

   character encoding: A method of representing a sequence of characters
      as a sequence of octets (maybe with variants).  Also, a method of
      (unambiguously) converting a sequence of octets into a sequence of
      characters.

RFC3987 - Page 5

   charset: The name of a parameter or attribute used to identify a
      character encoding.

   UCS: Universal Character Set. The coded character set defined by
      ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].

   IRI reference: Denotes the common usage of an Internationalized
      Resource Identifier.  An IRI reference may be absolute or
      relative.  However, the "IRI" that results from such a reference
      only includes absolute IRIs; any relative IRI references are
      resolved to their absolute form.  Note that in [RFC2396] URIs did
      not include fragment identifiers, but in [RFC3986] fragment
      identifiers are part of URIs.

   running text: Human text (paragraphs, sentences, phrases) with syntax
      according to orthographic conventions of a natural language, as
      opposed to syntax defined for ease of processing by machines
      (e.g., markup, programming languages).

   protocol element: Any portion of a message that affects processing of
      that message by the protocol in question.

   presentation element: A presentation form corresponding to a protocol
      element; for example, using a wider range of characters.

   create (a URI or IRI): With respect to URIs and IRIs, the term is
      used for the initial creation.  This may be the initial creation
      of a resource with a certain identifier, or the initial exposition
      of a resource under a particular identifier.

   generate (a URI or IRI): With respect to URIs and IRIs, the term is
      used when the IRI is generated by derivation from other
      information.

1.4.  Notation

   RFCs and Internet Drafts currently do not allow any characters
   outside the US-ASCII repertoire.  Therefore, this document uses
   various special notations to denote such characters in examples.

   In text, characters outside US-ASCII are sometimes referenced by
   using a prefix of 'U+', followed by four to six hexadecimal digits.

   To represent characters outside US-ASCII in examples, this document
   uses two notations: 'XML Notation' and 'Bidi Notation'.

RFC3987 - Page 6

   XML Notation uses a leading '&#x', a trailing ';', and the
   hexadecimal number of the character in the UCS in between.  For
   example, &#x44F; stands for CYRILLIC CAPITAL LETTER YA.  In this
   notation, an actual '&' is denoted by '&amp;'.

   Bidi Notation is used for bidirectional examples: Lowercase letters
   stand for Latin letters or other letters that are written left to
   right, whereas uppercase letters represent Arabic or Hebrew letters
   that are written right to left.

   To denote actual octets in examples (as opposed to percent-encoded
   octets), the two hex digits denoting the octet are enclosed in "<"
   and ">".  For example, the octet often denoted as 0xc9 is denoted
   here as <c9>.

   In this document, the key words "MUST", "MUST NOT", "REQUIRED",
   "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY",
   and "OPTIONAL" are to be interpreted as described in [RFC2119].

2.  IRI Syntax

   This section defines the syntax of Internationalized Resource
   Identifiers (IRIs).

   As with URIs, an IRI is defined as a sequence of characters, not as a
   sequence of octets.  This definition accommodates the fact that IRIs
   may be written on paper or read over the radio as well as stored or
   transmitted digitally.  The same IRI may be represented as different
   sequences of octets in different protocols or documents if these
   protocols or documents use different character encodings (and/or
   transfer encodings).  Using the same character encoding as the
   containing protocol or document ensures that the characters in the
   IRI can be handled (e.g., searched, converted, displayed) in the same
   way as the rest of the protocol or document.

2.1.  Summary of IRI Syntax

   IRIs are defined similarly to URIs in [RFC3986], but the class of
   unreserved characters is extended by adding the characters of the UCS
   (Universal Character Set, [ISO10646]) beyond U+007F, subject to the
   limitations given in the syntax rules below and in section 6.1.

   Otherwise, the syntax and use of components and reserved characters
   is the same as that in [RFC3986].  All the operations defined in
   [RFC3986], such as the resolution of relative references, can be
   applied to IRIs by IRI-processing software in exactly the same way as
   they are for URIs by URI-processing software.

RFC3987 - Page 7

   Characters outside the US-ASCII repertoire are not reserved and
   therefore MUST NOT be used for syntactical purposes, such as to
   delimit components in newly defined schemes.  For example, U+00A2,
   CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
   the 'iunreserved' category. This is similar to the fact that it is
   not possible to use '-' as a delimiter in URIs, because it is in the
   'unreserved' category.

2.2.  ABNF for IRI References and IRIs

   Although it might be possible to define IRI references and IRIs
   merely by their transformation to URI references and URIs, they can
   also be accepted and processed directly.  Therefore, an ABNF
   definition for IRI references (which are the most general concept and
   the start of the grammar) and IRIs is given here.  The syntax of this
   ABNF is described in [RFC2234].  Character numbers are taken from the
   UCS, without implying any actual binary encoding.  Terminals in the
   ABNF are characters, not bytes.

   The following grammar closely follows the URI grammar in [RFC3986],
   except that the range of unreserved characters is expanded to include
   UCS characters, with the restriction that private UCS characters can
   occur only in query parts.  The grammar is split into two parts:
   Rules that differ from [RFC3986] because of the above-mentioned
   expansion, and rules that are the same as those in [RFC3986].  For
   rules that are different than those in [RFC3986], the names of the
   non-terminals have been changed as follows.  If the non-terminal
   contains 'URI', this has been changed to 'IRI'.  Otherwise, an 'i'
   has been prefixed.

   The following rules are different from those in [RFC3986]:

   IRI            = scheme ":" ihier-part [ "?" iquery ]
                         [ "#" ifragment ]

   ihier-part     = "//" iauthority ipath-abempty
                  / ipath-absolute
                  / ipath-rootless
                  / ipath-empty

   IRI-reference  = IRI / irelative-ref

   absolute-IRI   = scheme ":" ihier-part [ "?" iquery ]

   irelative-ref  = irelative-part [ "?" iquery ] [ "#" ifragment ]

   irelative-part = "//" iauthority ipath-abempty
                       / ipath-absolute

RFC3987 - Page 8

                  / ipath-noscheme
                  / ipath-empty

   iauthority     = [ iuserinfo "@" ] ihost [ ":" port ]
   iuserinfo      = *( iunreserved / pct-encoded / sub-delims / ":" )
   ihost          = IP-literal / IPv4address / ireg-name

   ireg-name      = *( iunreserved / pct-encoded / sub-delims )

   ipath          = ipath-abempty   ; begins with "/" or is empty
                  / ipath-absolute  ; begins with "/" but not "//"
                  / ipath-noscheme  ; begins with a non-colon segment
                  / ipath-rootless  ; begins with a segment
                  / ipath-empty     ; zero characters

   ipath-abempty  = *( "/" isegment )
   ipath-absolute = "/" [ isegment-nz *( "/" isegment ) ]
   ipath-noscheme = isegment-nz-nc *( "/" isegment )
   ipath-rootless = isegment-nz *( "/" isegment )
   ipath-empty    = 0<ipchar>

   isegment       = *ipchar
   isegment-nz    = 1*ipchar
   isegment-nz-nc = 1*( iunreserved / pct-encoded / sub-delims
                        / "@" )
                  ; non-zero-length segment without any colon ":"

   ipchar         = iunreserved / pct-encoded / sub-delims / ":"
                  / "@"

   iquery         = *( ipchar / iprivate / "/" / "?" )

   ifragment      = *( ipchar / "/" / "?" )

   iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar

   ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

   iprivate       = %xE000-F8FF / %xF0000-FFFFD / %x100000-10FFFD

   Some productions are ambiguous.  The "first-match-wins" (a.k.a.
   "greedy") algorithm applies.  For details, see [RFC3986].

RFC3987 - Page 9

   The following rules are the same as those in [RFC3986]:

   scheme         = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

   port           = *DIGIT

   IP-literal     = "[" ( IPv6address / IPvFuture  ) "]"

   IPvFuture      = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

   IPv6address    =                            6( h16 ":" ) ls32
                  /                       "::" 5( h16 ":" ) ls32
                  / [               h16 ] "::" 4( h16 ":" ) ls32
                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                  / [ *4( h16 ":" ) h16 ] "::"              ls32
                  / [ *5( h16 ":" ) h16 ] "::"              h16
                  / [ *6( h16 ":" ) h16 ] "::"

   h16            = 1*4HEXDIG
   ls32           = ( h16 ":" h16 ) / IPv4address

   IPv4address    = dec-octet "." dec-octet "." dec-octet "." dec-octet

   dec-octet      = DIGIT                 ; 0-9
                  / %x31-39 DIGIT         ; 10-99
                  / "1" 2DIGIT            ; 100-199
                  / "2" %x30-34 DIGIT     ; 200-249
                  / "25" %x30-35          ; 250-255

   pct-encoded    = "%" HEXDIG HEXDIG

   unreserved     = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved       = gen-delims / sub-delims
   gen-delims     = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims     = "!" / "$" / "&" / "'" / "(" / ")"
                  / "*" / "+" / "," / ";" / "="

   This syntax does not support IPv6 scoped addressing zone identifiers.