Network Working Group M. Duerst
Request for Comments: 3987 W3C
Category: Standards Track M. Suignard
January 2005 Internationalized Resource Identifiers (IRIs)
Status of This Memo
This document specifies an Internet standards track protocol for the
Internet community, and requests discussion and suggestions for
improvements. Please refer to the current edition of the "Internet
Official Protocol Standards" (STD 1) for the standardization state
and status of this protocol. Distribution of this memo is unlimited.
Copyright (C) The Internet Society (2005).
This document defines a new protocol element, the Internationalized
Resource Identifier (IRI), as a complement to the Uniform Resource
Identifier (URI). An IRI is a sequence of characters from the
Universal Character Set (Unicode/ISO 10646). A mapping from IRIs to
URIs is defined, which means that IRIs can be used instead of URIs,
where appropriate, to identify resources.
The approach of defining a new protocol element was chosen instead of
extending or changing the definition of URIs. This was done in order
to allow a clear distinction and to avoid incompatibilities with
existing software. Guidelines are provided for the use and
deployment of IRIs in various protocols, formats, and software
components that currently deal with URIs.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 31.1. Overview and Motivation . . . . . . . . . . . . . . . . 31.2. Applicability . . . . . . . . . . . . . . . . . . . . . 31.3. Definitions . . . . . . . . . . . . . . . . . . . . . . 41.4. Notation . . . . . . . . . . . . . . . . . . . . . . . . 52. IRI Syntax . . . . . . . . . . . . . . . . . . . . . . . . . . 62.1. Summary of IRI Syntax . . . . . . . . . . . . . . . . . 62.2. ABNF for IRI References and IRIs . . . . . . . . . . . . 7
1.1. Overview and Motivation
A Uniform Resource Identifier (URI) is defined in [RFC3986] as a
sequence of characters chosen from a limited subset of the repertoire
of US-ASCII [ASCII] characters.
The characters in URIs are frequently used for representing words of
natural languages. This usage has many advantages: Such URIs are
easier to memorize, easier to interpret, easier to transcribe, easier
to create, and easier to guess. For most languages other than
English, however, the natural script uses characters other than A -
Z. For many people, handling Latin characters is as difficult as
handling the characters of other scripts is for those who use only
the Latin alphabet. Many languages with non-Latin scripts are
transcribed with Latin letters. These transcriptions are now often
used in URIs, but they introduce additional ambiguities.
The infrastructure for the appropriate handling of characters from
local scripts is now widely deployed in local versions of operating
system and application software. Software that can handle a wide
variety of scripts and languages at the same time is increasingly
common. Also, increasing numbers of protocols and formats can carry
a wide range of characters.
This document defines a new protocol element called Internationalized
Resource Identifier (IRI) by extending the syntax of URIs to a much
wider repertoire of characters. It also defines "internationalized"
versions corresponding to other constructs from [RFC3986], such as
URI references. The syntax of IRIs is defined in section 2, and the
relationship between IRIs and URIs in section 3.
Using characters outside of A - Z in IRIs brings some difficulties.
Section 4 discusses the special case of bidirectional IRIs, section 5
various forms of equivalence between IRIs, and section 6 the use of
IRIs in different situations. Section 7 gives additional informative
guidelines, and section 8 security considerations.
IRIs are designed to be compatible with recommendations for new URI
schemes [RFC2718]. The compatibility is provided by specifying a
well-defined and deterministic mapping from the IRI character
sequence to the functionally equivalent URI character sequence.
Practical use of IRIs (or IRI references) in place of URIs (or URI
references) depends on the following conditions being met:
a. A protocol or format element should be explicitly designated to
be able to carry IRIs. The intent is not to introduce IRIs into
contexts that are not defined to accept them. For example, XML
schema [XMLSchema] has an explicit type "anyURI" that includes
IRIs and IRI references. Therefore, IRIs and IRI references can
be in attributes and elements of type "anyURI". On the other
hand, in the HTTP protocol [RFC2616], the Request URI is defined
as a URI, which means that direct use of IRIs is not allowed in
b. The protocol or format carrying the IRIs should have a mechanism
to represent the wide range of characters used in IRIs, either
natively or by some protocol- or format-specific escaping
mechanism (for example, numeric character references in [XML1]).
c. The URI corresponding to the IRI in question has to encode
original characters into octets using UTF-8. For new URI
schemes, this is recommended in [RFC2718]. It can apply to a
whole scheme (e.g., IMAP URLs [RFC2192] and POP URLs [RFC2384],
or the URN syntax [RFC2141]). It can apply to a specific part of
a URI, such as the fragment identifier (e.g., [XPointer]). It
can apply to a specific URI or part(s) thereof. For details,
please see section 6.4.
The following definitions are used in this document; they follow the
terms in [RFC2130], [RFC2277], and [ISO10646].
character: A member of a set of elements used for the organization,
control, or representation of data. For example, "LATIN CAPITAL
LETTER A" names a character.
octet: An ordered sequence of eight bits considered as a unit.
character repertoire: A set of characters (in the mathematical
sequence of characters: A sequence of characters (one after another).
sequence of octets: A sequence of octets (one after another).
character encoding: A method of representing a sequence of characters
as a sequence of octets (maybe with variants). Also, a method of
(unambiguously) converting a sequence of octets into a sequence of
charset: The name of a parameter or attribute used to identify a
UCS: Universal Character Set. The coded character set defined by
ISO/IEC 10646 [ISO10646] and the Unicode Standard [UNIV4].
IRI reference: Denotes the common usage of an Internationalized
Resource Identifier. An IRI reference may be absolute or
relative. However, the "IRI" that results from such a reference
only includes absolute IRIs; any relative IRI references are
resolved to their absolute form. Note that in [RFC2396] URIs did
not include fragment identifiers, but in [RFC3986] fragment
identifiers are part of URIs.
running text: Human text (paragraphs, sentences, phrases) with syntax
according to orthographic conventions of a natural language, as
opposed to syntax defined for ease of processing by machines
(e.g., markup, programming languages).
protocol element: Any portion of a message that affects processing of
that message by the protocol in question.
presentation element: A presentation form corresponding to a protocol
element; for example, using a wider range of characters.
create (a URI or IRI): With respect to URIs and IRIs, the term is
used for the initial creation. This may be the initial creation
of a resource with a certain identifier, or the initial exposition
of a resource under a particular identifier.
generate (a URI or IRI): With respect to URIs and IRIs, the term is
used when the IRI is generated by derivation from other
RFCs and Internet Drafts currently do not allow any characters
outside the US-ASCII repertoire. Therefore, this document uses
various special notations to denote such characters in examples.
In text, characters outside US-ASCII are sometimes referenced by
using a prefix of 'U+', followed by four to six hexadecimal digits.
To represent characters outside US-ASCII in examples, this document
uses two notations: 'XML Notation' and 'Bidi Notation'.
XML Notation uses a leading '&#x', a trailing ';', and the
hexadecimal number of the character in the UCS in between. For
example, я stands for CYRILLIC CAPITAL LETTER YA. In this
notation, an actual '&' is denoted by '&'.
Bidi Notation is used for bidirectional examples: Lowercase letters
stand for Latin letters or other letters that are written left to
right, whereas uppercase letters represent Arabic or Hebrew letters
that are written right to left.
To denote actual octets in examples (as opposed to percent-encoded
octets), the two hex digits denoting the octet are enclosed in "<"
and ">". For example, the octet often denoted as 0xc9 is denoted
here as <c9>.
In this document, the key words "MUST", "MUST NOT", "REQUIRED",
"SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY",
and "OPTIONAL" are to be interpreted as described in [RFC2119].
2. IRI Syntax
This section defines the syntax of Internationalized Resource
As with URIs, an IRI is defined as a sequence of characters, not as a
sequence of octets. This definition accommodates the fact that IRIs
may be written on paper or read over the radio as well as stored or
transmitted digitally. The same IRI may be represented as different
sequences of octets in different protocols or documents if these
protocols or documents use different character encodings (and/or
transfer encodings). Using the same character encoding as the
containing protocol or document ensures that the characters in the
IRI can be handled (e.g., searched, converted, displayed) in the same
way as the rest of the protocol or document.
2.1. Summary of IRI Syntax
IRIs are defined similarly to URIs in [RFC3986], but the class of
unreserved characters is extended by adding the characters of the UCS
(Universal Character Set, [ISO10646]) beyond U+007F, subject to the
limitations given in the syntax rules below and in section 6.1.
Otherwise, the syntax and use of components and reserved characters
is the same as that in [RFC3986]. All the operations defined in
[RFC3986], such as the resolution of relative references, can be
applied to IRIs by IRI-processing software in exactly the same way as
they are for URIs by URI-processing software.
Characters outside the US-ASCII repertoire are not reserved and
therefore MUST NOT be used for syntactical purposes, such as to
delimit components in newly defined schemes. For example, U+00A2,
CENT SIGN, is not allowed as a delimiter in IRIs, because it is in
the 'iunreserved' category. This is similar to the fact that it is
not possible to use '-' as a delimiter in URIs, because it is in the
2.2. ABNF for IRI References and IRIs
Although it might be possible to define IRI references and IRIs
merely by their transformation to URI references and URIs, they can
also be accepted and processed directly. Therefore, an ABNF
definition for IRI references (which are the most general concept and
the start of the grammar) and IRIs is given here. The syntax of this
ABNF is described in [RFC2234]. Character numbers are taken from the
UCS, without implying any actual binary encoding. Terminals in the
ABNF are characters, not bytes.
The following grammar closely follows the URI grammar in [RFC3986],
except that the range of unreserved characters is expanded to include
UCS characters, with the restriction that private UCS characters can
occur only in query parts. The grammar is split into two parts:
Rules that differ from [RFC3986] because of the above-mentioned
expansion, and rules that are the same as those in [RFC3986]. For
rules that are different than those in [RFC3986], the names of the
non-terminals have been changed as follows. If the non-terminal
contains 'URI', this has been changed to 'IRI'. Otherwise, an 'i'
has been prefixed.
The following rules are different from those in [RFC3986]:
IRI = scheme ":" ihier-part [ "?" iquery ]
[ "#" ifragment ]
ihier-part = "//" iauthority ipath-abempty
IRI-reference = IRI / irelative-ref
absolute-IRI = scheme ":" ihier-part [ "?" iquery ]
irelative-ref = irelative-part [ "?" iquery ] [ "#" ifragment ]
irelative-part = "//" iauthority ipath-abempty