6. Normalization and Comparison
One of the most common operations on URIs is simple comparison:
determining whether two URIs are equivalent without using the URIs to
access their respective resource(s). A comparison is performed every
time a response cache is accessed, a browser checks its history to
color a link, or an XML parser processes tags within a namespace.
Extensive normalization prior to comparison of URIs is often used by
spiders and indexing engines to prune a search space or to reduce
duplication of request actions and response storage.
URI comparison is performed for some particular purpose. Protocols
or implementations that compare URIs for different purposes will
often be subject to differing design trade-offs in regards to how
much effort should be spent in reducing aliased identifiers. This
section describes various methods that may be used to compare URIs,
the trade-offs between them, and the types of applications that might
Because URIs exist to identify resources, presumably they should be
considered equivalent when they identify the same resource. However,
this definition of equivalence is not of much practical use, as there
is no way for an implementation to compare two resources unless it
has full knowledge or control of them. For this reason,
determination of equivalence or difference of URIs is based on string
comparison, perhaps augmented by reference to additional rules
provided by URI scheme definitions. We use the terms "different" and
"equivalent" to describe the possible outcomes of such comparisons,
but there are many application-dependent versions of equivalence.
Even though it is possible to determine that two URIs are equivalent,
URI comparison is not sufficient to determine whether two URIs
identify different resources. For example, an owner of two different
domain names could decide to serve the same resource from both,
resulting in two different URIs. Therefore, comparison methods are
designed to minimize false negatives while strictly avoiding false
In testing for equivalence, applications should not directly compare
relative references; the references should be converted to their
respective target URIs before comparison. When URIs are compared to
select (or avoid) a network action, such as retrieval of a
representation, fragment components (if any) should be excluded from
6.2. Comparison Ladder
A variety of methods are used in practice to test URI equivalence.
These methods fall into a range, distinguished by the amount of
processing required and the degree to which the probability of false
negatives is reduced. As noted above, false negatives cannot be
eliminated. In practice, their probability can be reduced, but this
reduction requires more processing and is not cost-effective for all
If this range of comparison practices is considered as a ladder, the
following discussion will climb the ladder, starting with practices
that are cheap but have a relatively higher chance of producing false
negatives, and proceeding to those that have higher computational
cost and lower risk of false negatives.
6.2.1. Simple String Comparison
If two URIs, when considered as character strings, are identical,
then it is safe to conclude that they are equivalent. This type of
equivalence test has very low computational cost and is in wide use
in a variety of applications, particularly in the domain of parsing.
Testing strings for equivalence requires some basic precautions.
This procedure is often referred to as "bit-for-bit" or
"byte-for-byte" comparison, which is potentially misleading. Testing
strings for equality is normally based on pair comparison of the
characters that make up the strings, starting from the first and
proceeding until both strings are exhausted and all characters are
found to be equal, until a pair of characters compares unequal, or
until one of the strings is exhausted before the other.
This character comparison requires that each pair of characters be
put in comparable form. For example, should one URI be stored in a
byte array in EBCDIC encoding and the second in a Java String object
(UTF-16), bit-for-bit comparisons applied naively will produce
errors. It is better to speak of equality on a character-for-
character basis rather than on a byte-for-byte or bit-for-bit basis.
In practical terms, character-by-character comparisons should be done
codepoint-by-codepoint after conversion to a common character
False negatives are caused by the production and use of URI aliases.
Unnecessary aliases can be reduced, regardless of the comparison
method, by consistently providing URI references in an already-
normalized form (i.e., a form identical to what would be produced
after normalization is applied, as described below).
Protocols and data formats often limit some URI comparisons to simple
string comparison, based on the theory that people and
implementations will, in their own best interest, be consistent in
providing URI references, or at least consistent enough to negate any
efficiency that might be obtained from further normalization.
6.2.2. Syntax-Based Normalization
Implementations may use logic based on the definitions provided by
this specification to reduce the probability of false negatives.
This processing is moderately higher in cost than character-for-
character string comparison. For example, an application using this
approach could reasonably consider the following two URIs equivalent:
Web user agents, such as browsers, typically apply this type of URI
normalization when determining whether a cached response is
available. Syntax-based normalization includes such techniques as
case normalization, percent-encoding normalization, and removal of
18.104.22.168. Case Normalization
For all URIs, the hexadecimal digits within a percent-encoding
triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
should be normalized to use uppercase letters for the digits A-F.
When a URI uses components of the generic syntax, the component
syntax equivalence rules always apply; namely, that the scheme and
host are case-insensitive and therefore should be normalized to
lowercase. For example, the URI <HTTP://www.EXAMPLE.com/> is
equivalent to <http://www.example.com/>. The other generic syntax
components are assumed to be case-sensitive unless specifically
defined otherwise by the scheme (see Section 6.2.3).
22.214.171.124. Percent-Encoding Normalization
The percent-encoding mechanism (Section 2.1) is a frequent source of
variance among otherwise identical URIs. In addition to the case
normalization issue noted above, some URI producers percent-encode
octets that do not require percent-encoding, resulting in URIs that
are equivalent to their non-encoded counterparts. These URIs should
be normalized by decoding any percent-encoded octet that corresponds
to an unreserved character, as described in Section 2.3.
126.96.36.199. Path Segment Normalization
The complete path segments "." and ".." are intended only for use
within relative references (Section 4.1) and are removed as part of
the reference resolution process (Section 5.2). However, some
deployed implementations incorrectly assume that reference resolution
is not necessary when the reference is already a URI and thus fail to
remove dot-segments when they occur in non-relative paths. URI
normalizers should remove dot-segments by applying the
remove_dot_segments algorithm to the path, as described in
6.2.3. Scheme-Based Normalization
The syntax and semantics of URIs vary from scheme to scheme, as
described by the defining specification for each scheme.
Implementations may use scheme-specific rules, at further processing
cost, to reduce the probability of false negatives. For example,
because the "http" scheme makes use of an authority component, has a
default port of "80", and defines an empty path to be equivalent to
"/", the following four URIs are equivalent:
In general, a URI that uses the generic syntax for authority with an
empty path should be normalized to a path of "/". Likewise, an
explicit ":port", for which the port is empty or the default for the
scheme, is equivalent to one where the port and its ":" delimiter are
elided and thus should be removed by scheme-based normalization. For
example, the second URI above is the normal form for the "http"
Another case where normalization varies by scheme is in the handling
of an empty authority component or empty host subcomponent. For many
scheme specifications, an empty authority or host is considered an
error; for others, it is considered equivalent to "localhost" or the
end-user's host. When a scheme defines a default for authority and a
URI reference to that default is desired, the reference should be
normalized to an empty authority for the sake of uniformity, brevity,
and internationalization. If, however, either the userinfo or port
subcomponents are non-empty, then the host should be given explicitly
even if it matches the default.
Normalization should not remove delimiters when their associated
component is empty unless licensed to do so by the scheme
specification. For example, the URI "http://example.com/?" cannot be
assumed to be equivalent to any of the examples above. Likewise, the
presence or absence of delimiters within a userinfo subcomponent is
usually significant to its interpretation. The fragment component is
not subject to any scheme-based normalization; thus, two URIs that
differ only by the suffix "#" are considered different regardless of
Some schemes define additional subcomponents that consist of case-
insensitive data, giving an implicit license to normalizers to
convert this data to a common case (e.g., all lowercase). For
example, URI schemes that define a subcomponent of path to contain an
Internet hostname, such as the "mailto" URI scheme, cause that
subcomponent to be case-insensitive and thus subject to case
normalization (e.g., "mailto:Joe@Example.COM" is equivalent to
"mailto:Joe@example.com", even though the generic syntax considers
the path component to be case-sensitive).
Other scheme-specific normalizations are possible.
6.2.4. Protocol-Based Normalization
Substantial effort to reduce the incidence of false negatives is
often cost-effective for web spiders. Therefore, they implement even
more aggressive techniques in URI comparison. For example, if they
observe that a URI such as
redirects to a URI differing only in the trailing slash
they will likely regard the two as equivalent in the future. This
kind of technique is only appropriate when equivalence is clearly
indicated by both the result of accessing the resources and the
common conventions of their scheme's dereference algorithm (in this
case, use of redirection by HTTP origin servers to avoid problems
with relative references).
7. Security Considerations
A URI does not in itself pose a security threat. However, as URIs
are often used to provide a compact set of instructions for access to
network resources, care must be taken to properly interpret the data
within a URI, to prevent that data from causing unintended access,
and to avoid including data that should not be revealed in plain
7.1. Reliability and Consistency
There is no guarantee that once a URI has been used to retrieve
information, the same information will be retrievable by that URI in
the future. Nor is there any guarantee that the information
retrievable via that URI in the future will be observably similar to
that retrieved in the past. The URI syntax does not constrain how a
given scheme or authority apportions its namespace or maintains it
over time. Such guarantees can only be obtained from the person(s)
controlling that namespace and the resource in question. A specific
URI scheme may define additional semantics, such as name persistence,
if those semantics are required of all naming authorities for that
7.2. Malicious Construction
It is sometimes possible to construct a URI so that an attempt to
perform a seemingly harmless, idempotent operation, such as the
retrieval of a representation, will in fact cause a possibly damaging
remote operation. The unsafe URI is typically constructed by
specifying a port number other than that reserved for the network
protocol in question. The client unwittingly contacts a site running
a different protocol service, and data within the URI contains
instructions that, when interpreted according to this other protocol,
cause an unexpected operation. A frequent example of such abuse has
been the use of a protocol-based scheme with a port component of
"25", thereby fooling user agent software into sending an unintended
or impersonating message via an SMTP server.
Applications should prevent dereference of a URI that specifies a TCP
port number within the "well-known port" range (0 - 1023) unless the
protocol being used to dereference that URI is compatible with the
protocol expected on that well-known port. Although IANA maintains a
registry of well-known ports, applications should make such
restrictions user-configurable to avoid preventing the deployment of
When a URI contains percent-encoded octets that match the delimiters
for a given resolution or dereference protocol (for example, CR and
LF characters for the TELNET protocol), these percent-encodings must
not be decoded before transmission across that protocol. Transfer of
the percent-encoding, which might violate the protocol, is less
harmful than allowing decoded octets to be interpreted as additional
operations or parameters, perhaps triggering an unexpected and
possibly harmful remote operation.
7.3. Back-End Transcoding
When a URI is dereferenced, the data within it is often parsed by
both the user agent and one or more servers. In HTTP, for example, a
typical user agent will parse a URI into its five major components,
access the authority's server, and send it the data within the
authority, path, and query components. A typical server will take
that information, parse the path into segments and the query into
key/value pairs, and then invoke implementation-specific handlers to
respond to the request. As a result, a common security concern for
server implementations that handle a URI, either as a whole or split
into separate components, is proper interpretation of the octet data
represented by the characters and percent-encodings within that URI.
Percent-encoded octets must be decoded at some point during the
dereference process. Applications must split the URI into its
components and subcomponents prior to decoding the octets, as
otherwise the decoded octets might be mistaken for delimiters.
Security checks of the data within a URI should be applied after
decoding the octets. Note, however, that the "%00" percent-encoding
(NUL) may require special handling and should be rejected if the
application is not expecting to receive raw data within a component.
Special care should be taken when the URI path interpretation process
involves the use of a back-end file system or related system
functions. File systems typically assign an operational meaning to
special characters, such as the "/", "\", ":", "[", and "]"
characters, and to special device names like ".", "..", "...", "aux",
"lpt", etc. In some cases, merely testing for the existence of such
a name will cause the operating system to pause or invoke unrelated
system calls, leading to significant security concerns regarding
denial of service and unintended data transfer. It would be
impossible for this specification to list all such significant
characters and device names. Implementers should research the
reserved names and characters for the types of storage device that
may be attached to their applications and restrict the use of data
obtained from URI components accordingly.
7.4. Rare IP Address Formats
Although the URI syntax for IPv4address only allows the common
dotted-decimal form of IPv4 address literal, many implementations
that process URIs make use of platform-dependent system routines,
such as gethostbyname() and inet_aton(), to translate the string
literal to an actual IP address. Unfortunately, such system routines
often allow and process a much larger set of formats than those
described in Section 3.2.2.
For example, many implementations allow dotted forms of three
numbers, wherein the last part is interpreted as a 16-bit quantity
and placed in the right-most two bytes of the network address (e.g.,
a Class B network). Likewise, a dotted form of two numbers means
that the last part is interpreted as a 24-bit quantity and placed in
the right-most three bytes of the network address (Class A), and a
single number (without dots) is interpreted as a 32-bit quantity and
stored directly in the network address. Adding further to the
confusion, some implementations allow each dotted part to be
interpreted as decimal, octal, or hexadecimal, as specified in the C
language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0
implies octal; otherwise, the number is interpreted as decimal).
These additional IP address formats are not allowed in the URI syntax
due to differences between platform implementations. However, they
can become a security concern if an application attempts to filter
access to resources based on the IP address in string literal format.
If this filtering is performed, literals should be converted to
numeric form and filtered based on the numeric value, and not on a
prefix or suffix of the string form.
7.5. Sensitive Information
URI producers should not provide a URI that contains a username or
password that is intended to be secret. URIs are frequently
displayed by browsers, stored in clear text bookmarks, and logged by
user agent history and intermediary applications (proxies). A
password appearing within the userinfo component is deprecated and
should be considered an error (or simply ignored) except in those
rare cases where the 'password' parameter is intended to be public.
7.6. Semantic Attacks
Because the userinfo subcomponent is rarely used and appears before
the host in the authority component, it can be used to construct a
URI intended to mislead a human user by appearing to identify one
(trusted) naming authority while actually identifying a different
authority hidden behind the noise. For example
might lead a human user to assume that the host is 'cnn.example.com',
whereas it is actually '10.0.0.1'. Note that a misleading userinfo
subcomponent could be much longer than the example above.
A misleading URI, such as that above, is an attack on the user's
preconceived notions about the meaning of a URI rather than an attack
on the software itself. User agents may be able to reduce the impact
of such attacks by distinguishing the various components of the URI
when they are rendered, such as by using a different color or tone to
render userinfo if any is present, though there is no panacea. More
information on URI-based semantic attacks can be found in [Siedzik].
8. IANA Considerations
URI scheme names, as defined by <scheme> in Section 3.1, form a
registered namespace that is managed by IANA according to the
procedures defined in [BCP35]. No IANA actions are required by this
This specification is derived from RFC 2396 [RFC2396], RFC 1808
[RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those
documents still apply. It also incorporates the update (with
corrections) for IPv6 literals in the host syntax, as defined by
Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in
[RFC2732]. In addition, contributions by Gisle Aas, Reese Anschultz,
Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll,
Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin
Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond,
Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael
Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew
Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert,
Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai
Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne,
Stuart Williams, and Henry Zongaro are gratefully acknowledged.
10.1. Normative References
[ASCII] American National Standards Institute, "Coded Character
Set -- 7-bit American Standard Code for Information
Interchange", ANSI X3.4, 1986.
[RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax
Specifications: ABNF", RFC 2234, November 1997.
[STD63] Yergeau, F., "UTF-8, a transformation format of
ISO 10646", STD 63, RFC 3629, November 2003.
[UCS] International Organization for Standardization,
"Information Technology - Universal Multiple-Octet Coded
Character Set (UCS)", ISO/IEC 10646:2003, December 2003.
10.2. Informative References
[BCP19] Freed, N. and J. Postel, "IANA Charset Registration
Procedures", BCP 19, RFC 2978, October 2000.
[BCP35] Petke, R. and I. King, "Registration Procedures for URL
Scheme Names", BCP 35, RFC 2717, November 1999.
[RFC0952] Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
host table specification", RFC 952, October 1985.
[RFC1034] Mockapetris, P., "Domain names - concepts and facilities",
STD 13, RFC 1034, November 1987.
[RFC1123] Braden, R., "Requirements for Internet Hosts - Application
and Support", STD 3, RFC 1123, October 1989.
[RFC1535] Gavron, E., "A Security Problem and Proposed Correction
With Widely Deployed DNS Software", RFC 1535,
[RFC1630] Berners-Lee, T., "Universal Resource Identifiers in WWW: A
Unifying Syntax for the Expression of Names and Addresses
of Objects on the Network as used in the World-Wide Web",
RFC 1630, June 1994.
[RFC1736] Kunze, J., "Functional Recommendations for Internet
Resource Locators", RFC 1736, February 1995.
[RFC1737] Sollins, K. and L. Masinter, "Functional Requirements for
Uniform Resource Names", RFC 1737, December 1994.
[RFC1738] Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
Resource Locators (URL)", RFC 1738, December 1994.
[RFC1808] Fielding, R., "Relative Uniform Resource Locators",
RFC 1808, June 1995.
[RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail
Extensions (MIME) Part Two: Media Types", RFC 2046,
[RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997.
[RFC2396] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
Resource Identifiers (URI): Generic Syntax", RFC 2396,
[RFC2518] Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D.
Jensen, "HTTP Extensions for Distributed Authoring --
WEBDAV", RFC 2518, February 1999.
[RFC2557] Palme, J., Hopmann, A., and N. Shelness, "MIME
Encapsulation of Aggregate Documents, such as HTML
(MHTML)", RFC 2557, March 1999.
[RFC2718] Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke,
"Guidelines for new URL Schemes", RFC 2718, November 1999.
[RFC2732] Hinden, R., Carpenter, B., and L. Masinter, "Format for
Literal IPv6 Addresses in URL's", RFC 2732, December 1999.
[RFC3305] Mealling, M. and R. Denenberg, "Report from the Joint
W3C/IETF URI Planning Interest Group: Uniform Resource
Identifiers (URIs), URLs, and Uniform Resource Names
(URNs): Clarifications and Recommendations", RFC 3305,
[RFC3490] Faltstrom, P., Hoffman, P., and A. Costello,
"Internationalizing Domain Names in Applications (IDNA)",
RFC 3490, March 2003.
[RFC3513] Hinden, R. and S. Deering, "Internet Protocol Version 6
(IPv6) Addressing Architecture", RFC 3513, April 2003.
[Siedzik] Siedzik, R., "Semantic Attacks: What's in a URL?",
April 2001, <http://www.giac.org/practical/gsec/
12 3 4 5 6 7 8 9
The numbers in the second line above are only to assist readability;
they indicate the reference points for each subexpression (i.e., each
paired parenthesis). We refer to the value matched for subexpression
<n> as $<n>. For example, matching the above expression to
results in the following subexpression matches:
$1 = http:
$2 = http
$3 = //www.ics.uci.edu
$4 = www.ics.uci.edu
$5 = /pub/ietf/uri/
$6 = <undefined>
$7 = <undefined>
$8 = #Related
$9 = Related
where <undefined> indicates that the component is not present, as is
the case for the query component in the above example. Therefore, we
can determine the value of the five components as
scheme = $2
authority = $4
path = $5
query = $7
fragment = $9
Going in the opposite direction, we can recreate a URI reference from
its components by using the algorithm of Section 5.3.
Appendix C. Delimiting a URI in Context
URIs are often transmitted through formats that do not provide a
clear context for their interpretation. For example, there are many
occasions when a URI is included in plain text; examples include text
sent in email, USENET news, and on printed paper. In such cases, it
is important to be able to delimit the URI from the rest of the text,
and in particular from punctuation marks that might be mistaken for
part of the URI.
In practice, URIs are delimited in a variety of ways, but usually
within double-quotes "http://example.com/", angle brackets
<http://example.com/>, or just by using whitespace:
These wrappers do not form part of the URI.
In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
have to be added to break a long URI across lines. The whitespace
should be ignored when the URI is extracted.
No whitespace should be introduced after a hyphen ("-") character.
Because some typesetters and printers may (erroneously) introduce a
hyphen at the end of line when breaking it, the interpreter of a URI
containing a line break immediately after a hyphen should ignore all
whitespace around the line break and should be aware that the hyphen
may or may not actually be part of the URI.
Using <> angle brackets around each URI is especially recommended as
a delimiting style for a reference that contains embedded whitespace.
The prefix "URL:" (with or without a trailing space) was formerly
recommended as a way to help distinguish a URI from other bracketed
designators, though it is not commonly used in practice and is no
For robustness, software that accepts user-typed URI should attempt
to recognize and strip both delimiters and embedded whitespace.
For example, the text
Yes, Jim, I found it under "http://www.w3.org/Addressing/",
but you can probably pick it up from <ftp://foo.example.
com/rfc/>. Note the warning in <http://www.ics.uci.edu/pub/
contains the URI references
Appendix D. Changes from RFC 2396
An ABNF rule for URI has been introduced to correspond to one common
usage of the term: an absolute URI with optional fragment.
IPv6 (and later) literals have been added to the list of possible
identifiers for the host portion of an authority component, as
described by [RFC2732], with the addition of "[" and "]" to the
reserved set and a version flag to anticipate future versions of IP
literals. Square brackets are now specified as reserved within the
authority component and are not allowed outside their use as
delimiters for an IP literal within host. In order to make this
change without changing the technical definition of the path, query,
and fragment components, those rules were redefined to directly
specify the characters allowed.
As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal
address, which, unfortunately, lacks an ABNF description of
IPv6address, we created a new ABNF rule for IPv6address that matches
the text representations defined by Section 2.2 of [RFC3513].
Likewise, the definition of IPv4address has been improved in order to
limit each decimal octet to the range 0-255.
Section 6, on URI normalization and comparison, has been completely
rewritten and extended by using input from Tim Bray and discussion
within the W3C Technical Architecture Group.
The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
[RFC2234]. This change required all rule names that formerly
included underscore characters to be renamed with a dash instead. In
addition, a number of syntax rules have been eliminated or simplified
to make the overall grammar more comprehensible. Specifications that
refer to the obsolete grammar rules may be understood by replacing
those rules according to the following table:
| obsolete rule | translation |
| absoluteURI | absolute-URI |
| relativeURI | relative-part [ "?" query ] |
| hier_part | ( "//" authority path-abempty / |
| | path-absolute ) [ "?" query ] |
| | |
| opaque_part | path-rootless [ "?" query ] |
| net_path | "//" authority path-abempty |
| abs_path | path-absolute |
| rel_path | path-rootless |
| rel_segment | segment-nz-nc |
| reg_name | reg-name |
| server | authority |
| hostport | host [ ":" port ] |
| hostname | reg-name |
| path_segments | path-abempty |
| param | *<pchar excluding ";"> |
| | |
| uric | unreserved / pct-encoded / ";" / "?" / ":" |
| | / "@" / "&" / "=" / "+" / "$" / "," / "/" |
| | |
| uric_no_slash | unreserved / pct-encoded / ";" / "?" / ":" |
| | / "@" / "&" / "=" / "+" / "$" / "," |
| | |
| mark | "-" / "_" / "." / "!" / "~" / "*" / "'" |
| | / "(" / ")" |
| | |
| escaped | pct-encoded |
| hex | HEXDIG |
| alphanum | ALPHA / DIGIT |
Use of the above obsolete rules for the definition of scheme-specific
syntax is deprecated.
Section 2, on characters, has been rewritten to explain what
characters are reserved, when they are reserved, and why they are
reserved, even when they are not used as delimiters by the generic
syntax. The mark characters that are typically unsafe to decode,
including the exclamation mark ("!"), asterisk ("*"), single-quote
("'"), and open and close parentheses ("(" and ")"), have been moved
to the reserved set in order to clarify the distinction between
reserved and unreserved and, hopefully, to answer the most common
question of scheme designers. Likewise, the section on
percent-encoded characters has been rewritten, and URI normalizers
are now given license to decode any percent-encoded octets
corresponding to unreserved characters. In general, the terms
"escaped" and "unescaped" have been replaced with "percent-encoded"
and "decoded", respectively, to reduce confusion with other forms of
The ABNF for URI and URI-reference has been redesigned to make them
more friendly to LALR parsers and to reduce complexity. As a result,
the layout form of syntax description has been removed, along with
the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path,
path_segments, rel_segment, and mark rules. All references to
"opaque" URIs have been replaced with a better description of how the
path component may be opaque to hierarchy. The relativeURI rule has
been replaced with relative-ref to avoid unnecessary confusion over
whether they are a subset of URI. The ambiguity regarding the
parsing of URI-reference as a URI or a relative-ref with a colon in
the first segment has been eliminated through the use of five
separate path matching rules.
The fragment identifier has been moved back into the section on
generic syntax components and within the URI and relative-ref rules,
though it remains excluded from absolute-URI. The number sign ("#")
character has been moved back to the reserved set as a result of
reintegrating the fragment syntax.
The ABNF has been corrected to allow the path component to be empty.
This also allows an absolute-URI to consist of nothing after the
"scheme:", as is present in practice with the "dav:" namespace
[RFC2518] and with the "about:" scheme used internally by many WWW
browser implementations. The ambiguity regarding the boundary
between authority and path has been eliminated through the use of
five separate path matching rules.
Registry-based naming authorities that use the generic syntax are now
defined within the host rule. This change allows current
implementations, where whatever name provided is simply fed to the
local name resolution mechanism, to be consistent with the
specification. It also removes the need to re-specify DNS name
formats here. Furthermore, it allows the host component to contain
percent-encoded octets, which is necessary to enable
internationalized domain names to be provided in URIs, processed in
their native character encodings at the application layers above URI
processing, and passed to an IDNA library as a registered name in the
UTF-8 character encoding. The server, hostport, hostname,
domainlabel, toplabel, and alphanum rules have been removed.
The resolving relative references algorithm of [RFC2396] has been
rewritten with pseudocode for this revision to improve clarity and
fix the following issues:
o [RFC2396] section 5.2, step 6a, failed to account for a base URI
with no path.
o Restored the behavior of [RFC1808] where, if the reference
contains an empty path and a defined query component, the target
URI inherits the base URI's path component.
o The determination of whether a URI reference is a same-document
reference has been decoupled from the URI parser, simplifying the
URI processing interface within applications in a way consistent
with the internal architecture of deployed URI processing
implementations. The determination is now based on comparison to
the base URI after transforming a reference to absolute form,
rather than on the format of the reference itself. This change
may result in more references being considered "same-document"
under this specification than there would be under the rules given
in RFC 2396, especially when normalization is used to reduce
aliases. However, it does not change the status of existing
o Separated the path merge routine into two routines: merge, for
describing combination of the base URI path with a relative-path
reference, and remove_dot_segments, for describing how to remove
the special "." and ".." segments from a composed path. The
remove_dot_segments algorithm is now applied to all URI reference
paths in order to match common implementations and to improve the
normalization of URIs in practice. This change only impacts the
parsing of abnormal references and same-scheme references wherein
the base URI has a non-hierarchical path.
authority 17, 18
base URI 28
character encoding 4
characters 8, 11
coded character set 4
World Wide Web Consortium
Massachusetts Institute of Technology
77 Massachusetts Avenue
Cambridge, MA 02139
Roy T. Fielding
5251 California Ave., Suite 110
Irvine, CA 92617
Adobe Systems Incorporated
345 Park Ave
San Jose, CA 95110
Full Copyright Statement
Copyright (C) The Internet Society (2005).
This document is subject to the rights, licenses and restrictions
contained in BCP 78, and except as set forth therein, the authors
retain all their rights.
This document and the information contained herein are provided on an
"AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.
The IETF takes no position regarding the validity or scope of any
Intellectual Property Rights or other rights that might be claimed to
pertain to the implementation or use of the technology described in
this document or the extent to which any license under such rights
might or might not be available; nor does it represent that it has
made any independent effort to identify any such rights. Information
on the IETF's procedures with respect to rights in IETF Documents can
be found in BCP 78 and BCP 79.
Copies of IPR disclosures made to the IETF Secretariat and any
assurances of licenses to be made available, or the result of an
attempt made to obtain a general license or permission for the use of
such proprietary rights by implementers or users of this
specification can be obtained from the IETF on-line IPR repository at
The IETF invites any interested party to bring to its attention any
copyrights, patents or patent applications, or other proprietary
rights that may cover technology that may be required to implement
this standard. Please address the information to the IETF at ietf-
Funding for the RFC Editor function is currently provided by the