3.6. Possibilities for Registration
Possibilities for registration of subtags or information about
o Primary language subtags for languages not listed in ISO 639 that
are not variants of any listed or registered language MAY be
registered. At the time this document was created, there were no
examples of this form of subtag. Before attempting to register a
language subtag, there MUST be an attempt to register the language
with ISO 639. Subtags MUST NOT be registered for languages
defined by codes that exist in ISO 639-1, ISO 639-2, or ISO 639-3;
that are under consideration by the ISO 639 registration
authorities; or that have never been attempted for registration
with those authorities. If ISO 639 has previously rejected a
language for registration, it is reasonable to assume that there
must be additional, very compelling evidence of need before it
will be registered as a primary language subtag in the IANA
registry (to the extent that it is very unlikely that any subtags
will be registered of this type).
o Dialect or other divisions or variations within a language, its
orthography, writing system, regional or historical usage,
transliteration or other transformation, or distinguishing
variation MAY be registered as variant subtags. An example is the
'rozaj' subtag (the Resian dialect of Slovenian).
o The addition or maintenance of fields (generally of an
informational nature) in tag or subtag records as described in
Section 3.1 is allowed. Such changes are subject to the stability
provisions in Section 3.4. This includes 'Description',
'Comments', 'Deprecated', and 'Preferred-Value' fields for
obsolete or withdrawn codes, or the addition of 'Suppress-Script'
or 'Macrolanguage' fields to primary language subtags, as well as
other changes permitted by this document, such as the addition of
an appropriate 'Prefix' field to a variant subtag.
o The addition of records and related field value changes necessary
to reflect assignments made by ISO 639, ISO 15924, ISO 3166-1, and
UN M.49 as described in Section 3.4 is allowed.
Subtags proposed for registration that would cause all or part of a
grandfathered tag to become redundant but whose meaning conflicts
with or alters the meaning of the grandfathered tag MUST be rejected.
This document leaves the decision on what subtags or changes to
subtags are appropriate (or not) to the registration process
described in Section 3.5.
Note: Four-character primary language subtags are reserved to allow
for the possibility of alpha4 codes in some future addition to the
ISO 639 family of standards.
ISO 639 defines a registration authority for additions to and changes
in the list of languages in ISO 639. This agency is:
International Information Centre for Terminology (Infoterm)
Aichholzgasse 6/12, AT-1120
Phone: +43 1 26 75 35 Ext. 312 Fax: +43 1 216 32 72
ISO 639-2 defines a registration authority for additions to and
changes in the list of languages in ISO 639-2. This agency is:
Library of Congress
Network Development and MARC Standards Office
Washington, DC 20540, USA
Phone: +1 202 707 6237 Fax: +1 202 707 0115
ISO 639-3 defines a registration authority for additions to and
changes in the list of languages in ISO 639-3. This agency is:
ISO 639-3 Registrar
7500 W. Camp Wisdom Rd.
Dallas, TX 75236, USA
Phone: +1 972 708 7400, ext. 2293
Fax: +1 972 708 7546
ISO 639-5 defines a registration authority for additions to and
changes in the list of languages in ISO 639-5. This agency is the
same as for ISO 639-2 and is:
Library of Congress
Network Development and MARC Standards Office
Washington, DC 20540, USA
Phone: +1 202 707 6237
Fax: +1 202 707 0115
The maintenance agency for ISO 3166-1 (country codes) is:
ISO 3166 Maintenance Agency
c/o International Organization for Standardization
Case postale 56
CH-1211 Geneva 20, Switzerland
Phone: +41 22 749 72 33 Fax: +41 22 749 73 49
The registration authority for ISO 15924 (script codes) is:
Mountain View, CA 94039-1476, USA
The Statistics Division of the United Nations Secretariat maintains
the Standard Country or Area Codes for Statistical Use and can be
Statistical Services Branch
United Nations, Room DC2-1620
New York, NY 10017, USA
URL: http://unstats.un.org/unsd/methods/m49/m49alpha.htm3.7. Extensions and the Extensions Registry
Extension subtags are those introduced by single-character subtags
("singletons") other than 'x'. They are reserved for the generation
of identifiers that contain a language component and are compatible
with applications that understand language tags.
The structure and form of extensions are defined by this document so
that implementations can be created that are forward compatible with
applications that might be created using singletons in the future.
In addition, defining a mechanism for maintaining singletons will
lend stability to this document by reducing the likely need for
future revisions or updates.
Single-character subtags are assigned by IANA using the "IETF Review"
policy defined by [RFC5226]. This policy requires the development of
an RFC, which SHALL define the name, purpose, processes, and
procedures for maintaining the subtags. The maintaining or
registering authority, including name, contact email, discussion list
email, and URL location of the registry, MUST be indicated clearly in
the RFC. The RFC MUST specify or include each of the following:
o The specification MUST reference the specific version or revision
of this document that governs its creation and MUST reference this
section of this document.
o The specification and all subtags defined by the specification
MUST follow the ABNF and other rules for the formation of tags and
subtags as defined in this document. In particular, it MUST
specify that case is not significant and that subtags MUST NOT
exceed eight characters in length.
o The specification MUST specify a canonical representation.
o The specification of valid subtags MUST be available over the
Internet and at no cost.
o The specification MUST be in the public domain or available via a
royalty-free license acceptable to the IETF and specified in the
o The specification MUST be versioned, and each version of the
specification MUST be numbered, dated, and stable.
o The specification MUST be stable. That is, extension subtags,
once defined by a specification, MUST NOT be retracted or change
in meaning in any substantial way.
o The specification MUST include, in a separate section, the
registration form reproduced in this section (below) to be used in
registering the extension upon publication as an RFC.
o IANA MUST be informed of changes to the contact information and
URL for the specification.
IANA will maintain a registry of allocated single-character
(singleton) subtags. This registry MUST use the record-jar format
described by the ABNF in Section 3.1.1. Upon publication of an
extension as an RFC, the maintaining authority defined in the RFC
MUST forward this registration form to <firstname.lastname@example.org>, who MUST
forward the request to <email@example.com>. The maintaining authority of
the extension MUST maintain the accuracy of the record by sending an
updated full copy of the record to <firstname.lastname@example.org> with the subject
line "LANGUAGE TAG EXTENSION UPDATE" whenever content changes. Only
the 'Comments', 'Contact_Email', 'Mailing_List', and 'URL' fields MAY
be modified in these updates.
Failure to maintain this record, maintain the corresponding registry,
or meet other conditions imposed by this section of this document MAY
be appealed to the IESG [RFC2028] under the same rules as other IETF
decisions (see [RFC2026]) and MAY result in the authority to maintain
the extension being withdrawn or reassigned by the IESG.
Figure 6: Format of Records in the Language Tag Extensions Registry
'Identifier' contains the single-character subtag (singleton)
assigned to the extension. The Internet-Draft submitted to define
the extension SHOULD specify which letter or digit to use, although
the IESG MAY change the assignment when approving the RFC.
'Description' contains the name and description of the extension.
'Comments' is an OPTIONAL field and MAY contain a broader description
of the extension.
'Added' contains the date the extension's RFC was published in the
"full-date" format specified in [RFC3339]. For example: 2004-06-28
represents June 28, 2004, in the Gregorian calendar.
'RFC' contains the RFC number assigned to the extension.
'Authority' contains the name of the maintaining authority for the
'Contact_Email' contains the email address used to contact the
'Mailing_List' contains the URL or subscription email address of the
mailing list used by the maintaining authority.
'URL' contains the URL of the registry for this extension.
The determination of whether an Internet-Draft meets the above
conditions and the decision to grant or withhold such authority rests
solely with the IESG and is subject to the normal review and appeals
process associated with the RFC process.
Extension authors are strongly cautioned that many (including most
well-formed) processors will be unaware of any special relationships
or meaning inherent in the order of extension subtags. Extension
authors SHOULD avoid subtag relationships or canonicalization
mechanisms that interfere with matching or with length restrictions
that sometimes exist in common protocols where the extension is used.
In particular, applications MAY truncate the subtags in doing
matching or in fitting into limited lengths, so it is RECOMMENDED
that the most significant information be in the most significant
(left-most) subtags and that the specification gracefully handle
When a language tag is to be used in a specific, known protocol, it
is RECOMMENDED that the language tag not contain extensions not
supported by that protocol. In addition, note that some protocols
MAY impose upper limits on the length of the strings used to store or
transport the language tag.
3.8. Update of the Language Subtag Registry
After the adoption of this document, the IANA Language Subtag
Registry needed an update so that it would contain the complete set
of subtags valid in a language tag. [RFC5645] describes the process
used to create this update.
Registrations that are in process under the rules defined in
[RFC4646] when this document is adopted MUST be completed under the
rules contained in this document.
3.9. Applicability of the Subtag Registry
The Language Subtag Registry is the source of data elements used to
construct language tags, following the rules described in this
document. Language tags are designed for indicating linguistic
attributes of various content, including not only text but also most
media formats, such as video or audio. They also form the basis for
language and locale negotiation in various protocols and APIs.
The registry is therefore applicable to many applications that need
some form of language identification, with these limitations:
o It is not designed to be the sole data source in the creation of a
language-selection user interface. For example, the registry does
not contain translations for subtag descriptions or for tags
composed from the subtags. Sources for localized data based on
the registry are generally available, notably [CLDR]. Nor does
the registry indicate which subtag combinations are particularly
useful or relevant.
o It does not provide information indicating relationships between
different languages, such as might be used in a user interface to
select language tags hierarchically, regionally, or on some other
o It does not supply information about potential overlap between
different language tags, as the notion of what constitutes a
language is not precise: several different language tags might be
reasonable choices for the same given piece of content.
o It does not contain information about appropriate fallback choices
when performing language negotiation. A good fallback language
might be linguistically unrelated to the specified language. The
fact that one language is often used as a fallback language for
another is usually a result of outside factors, such as geography,
history, or culture -- factors that might not apply in all cases.
For example, most people who use Breton (a Celtic language used in
the Northwest of France) would probably prefer to be served French
(a Romance language) if Breton isn't available.
4. Formation and Processing of Language Tags
This section addresses how to use the information in the registry
with the tag syntax to choose, form, and process language tags.
4.1. Choice of Language Tag
The guiding principle in forming language tags is to "tag content
wisely." Sometimes there is a choice between several possible tags
for the same content. The choice of which tag to use depends on the
content and application in question, and some amount of judgment
might be necessary when selecting a tag.
Interoperability is best served when the same language tag is used
consistently to represent the same language. If an application has
requirements that make the rules here inapplicable, then that
application risks damaging interoperability. It is strongly
RECOMMENDED that users not define their own rules for language tag
Standards, protocols, and applications that reference this document
normatively but apply different rules to the ones given in this
section MUST specify how language tag selection varies from the
guidelines given here.
To ensure consistent backward compatibility, this document contains
several provisions to account for potential instability in the
standards used to define the subtags that make up language tags.
These provisions mean that no valid language tag can become invalid,
nor will a language tag have a narrower scope in the future (it may
have a broader scope). The most appropriate language tag for a given
application or content item might evolve over time, but once applied,
the tag itself cannot become invalid or have its meaning wholly
A subtag SHOULD only be used when it adds useful distinguishing
information to the tag. Extraneous subtags interfere with the
meaning, understanding, and processing of language tags. In
particular, users and implementations SHOULD follow the 'Prefix' and
'Suppress-Script' fields in the registry (defined in Section 3.1):
these fields provide guidance on when specific additional subtags
SHOULD be used or avoided in a language tag.
The choice of subtags used to form a language tag SHOULD follow these
1. Use as precise a tag as possible, but no more specific than is
justified. Avoid using subtags that are not important for
distinguishing content in an application.
* For example, 'de' might suffice for tagging an email written
in German, while "de-CH-1996" is probably unnecessarily
precise for such a task.
* Note that some subtag sequences might not represent the
language a casual user might expect. For example, the Swiss
German (Schweizerdeutsch) language is represented by "gsw-CH"
and not by "de-CH". This latter tag represents German ('de')
as used in Switzerland ('CH'), also known as Swiss High German
(Schweizer Hochdeutsch). Both are real languages, and
distinguishing between them could be important to an
2. The script subtag SHOULD NOT be used to form language tags unless
the script adds some distinguishing information to the tag.
Script subtags were first formally defined in [RFC4646]. Their
use can affect matching and subtag identification for
implementations of [RFC1766] or [RFC3066] (which are obsoleted by
this document), as these subtags appear between the primary
language and region subtags. Some applications can benefit from
the use of script subtags in language tags, as long as the use is
consistent for a given context. Script subtags are never
appropriate for unwritten content (such as audio recordings).
The field 'Suppress-Script' in the primary or extended language
record in the registry indicates script subtags that do not add
distinguishing information for most applications; this field
defines when users SHOULD NOT include a script subtag with a
particular primary language subtag.
For example, if an implementation selects content using Basic
Filtering [RFC4647] (originally described in Section 14.4 of
[RFC2616]) and the user requested the language range "en-US",
content labeled "en-Latn-US" will not match the request and thus
not be selected. Therefore, it is important to know when script
subtags will customarily be used and when they ought not be used.
* The subtag 'Latn' should not be used with the primary language
'en' because nearly all English documents are written in the
Latin script and it adds no distinguishing information.
However, if a document were written in English mixing Latin
script with another script such as Braille ('Brai'), then it
might be appropriate to choose to indicate both scripts to aid
in content selection, such as the application of a style
* When labeling content that is unwritten (such as a recording
of human speech), the script subtag should not be used, even
if the language is customarily written in several scripts.
Thus, the subtitles to a movie might use the tag "uz-Arab"
(Uzbek, Arabic script), but the audio track for the same
language would be tagged simply "uz". (The tag "uz-Zxxx"
could also be used where content is not written, as the subtag
'Zxxx' represents the "Code for unwritten documents".)
3. If a tag or subtag has a 'Preferred-Value' field in its registry
entry, then the value of that field SHOULD be used to form the
language tag in preference to the tag or subtag in which the
preferred value appears.
* For example, use 'jbo' for Lojban in preference to the
grandfathered tag "art-lojban".
4. Use subtags or sequences of subtags for individual languages in
preference to subtags for language collections. A "language
collection" is a group of languages that are descended from a
common ancestor, are spoken in the same geographical area, or are
otherwise related. Certain language collections are assigned
codes by [ISO639-5] (and some of these [ISO639-5] codes are also
defined as collections in [ISO639-2]). These codes are included
as primary language subtags in the registry. Subtags for a
language collection in the registry have a 'Scope' field with a
value of 'collection'. A subtag for a language collection is
always preferred to less specific alternatives such as 'mul' and
'und' (see below), and a subtag representing a language
collection MAY be used when more specific language information is
not available. However, most users and implementations do not
know there is a relationship between the collection and its
individual languages. In addition, the relationship between the
individual languages in the collection is not well defined; in
particular, the languages are usually not mutually intelligible.
Since the subtags are different, a request for the collection
will typically only produce items tagged with the collection's
subtag, not items tagged with subtags for the individual
languages contained in the collection.
* For example, collections are interpreted inclusively, so the
subtag 'gem' (Germanic languages) could, but SHOULD NOT, be
used with content that would be better tagged with "en"
(English), "de" (German), or "gsw" (Swiss German, Alemannic).
While 'gem' collects all of these (and other) languages, most
implementations will not match 'gem' to the individual
languages; thus, using the subtag will not produce the desired
5. [ISO639-2] has defined several codes included in the subtag
registry that require additional care when choosing language
tags. In most of these cases, where omitting the language tag is
permitted, such omission is preferable to using these codes.
Language tags SHOULD NOT incorporate these subtags as a prefix,
unless the additional information conveys some value to the
* The 'mul' (Multiple) primary language subtag identifies
content in multiple languages. This subtag SHOULD NOT be used
when a list of languages or individual tags for each content
element can be used instead. For example, the 'Content-
Language' header [RFC3282] allows a list of languages to be
used, not just a single language tag.
* The 'und' (Undetermined) primary language subtag identifies
linguistic content whose language is not determined. This
subtag SHOULD NOT be used unless a language tag is required
and language information is not available or cannot be
determined. Omitting the language tag (where permitted) is
preferred. The 'und' subtag might be useful for protocols
that require a language tag to be provided or where a primary
language subtag is required (such as in "und-Latn"). The
'und' subtag MAY also be useful when matching language tags in
* The 'zxx' (Non-Linguistic, Not Applicable) primary language
subtag identifies content for which a language classification
is inappropriate or does not apply. Some examples might
include instrumental or electronic music; sound recordings
consisting of nonverbal sounds; audiovisual materials with no
narration, dialog, printed titles, or subtitles; machine-
readable data files consisting of machine languages or
character codes; or programming source code.
* The 'mis' (Uncoded) primary language subtag identifies content
whose language is known but that does not currently have a
corresponding subtag. This subtag SHOULD NOT be used.
Because the addition of other codes in the future can render
its application invalid, it is inherently unstable and hence
incompatible with the stability goals of BCP 47. It is always
preferable to use other subtags: either 'und' or (with prior
agreement) private use subtags.
6. Use variant subtags sparingly and in the correct order. Most
variant subtags have one or more 'Prefix' fields in the registry
that express the list of subtags with which they are appropriate.
Variants SHOULD only be used with subtags that appear in one of
these 'Prefix' fields. If a variant lists a second variant in
one of its 'Prefix' fields, the first variant SHOULD appear
directly after the second variant in any language tag where both
occur. General purpose variants (those with no 'Prefix' fields
at all) SHOULD appear after any other variant subtags. Order any
remaining variants by placing the most significant subtag first.
If none of the subtags is more significant or no relationship can
be determined, alphabetize the subtags. Because variants are
very specialized, using many of them together generally makes the
tag so narrow as to override the additional precision gained.
Putting the subtags into another order interferes with
interoperability, as well as the overall interpretation of the
* The tag "en-scotland-fonipa" (English, Scottish dialect, IPA
phonetic transcription) is correctly ordered because
'scotland' has a 'Prefix' of "en", while 'fonipa' has no
* The tag "sl-IT-rozaj-biske-1994" is correctly ordered: 'rozaj'
lists "sl" as its sole 'Prefix'; 'biske' lists "sl-rozaj" as
its sole 'Prefix'. The subtag '1994' has several prefixes,
including "sl-rozaj". However, it follows both 'rozaj' and
'biske' because one of its 'Prefix' fields is "sl-rozaj-
7. The grandfathered tag "i-default" (Default Language) was
originally registered according to [RFC1766] to meet the needs of
[RFC2277]. It is not used to indicate a specific language, but
rather to identify the condition or content used where the
language preferences of the user cannot be established. It
SHOULD NOT be used except as a means of labeling the default
content for applications or protocols that require default
language content to be labeled with that specific tag. It MAY
also be used by an application or protocol to identify when the
default language content is being returned.
4.1.1. Tagging Encompassed Languages
Some primary language records in the registry have a 'Macrolanguage'
field (Section 3.1.10) that contains a mapping from each "encompassed
language" to its macrolanguage. The 'Macrolanguage' mapping doesn't
define what the relationship between the encompassed language and its
macrolanguage is, nor does it define how languages encompassed by the
same macrolanguage are related to each other. Two different
languages encompassed by the same macrolanguage may differ from one
another more than, say, French and Spanish do.
A few specific macrolanguages, such as Chinese ('zh') and Arabic
('ar'), are handled differently. See Section 4.1.2.
The more specific encompassed language subtag SHOULD be used to form
the language tag, although either the macrolanguage's primary
language subtag or the encompassed language's subtag MAY be used.
This means, for example, tagging Plains Cree with 'crk' rather than
'cr' (Cree), and so forth.
Each macrolanguage subtag's scope, by definition, includes all of its
encompassed languages. Since the relationship between encompassed
languages varies, users cannot assume that the macrolanguage subtag
means any particular encompassed language, nor that any given pair of
encompassed languages are mutually intelligible or otherwise
Applications MAY use macrolanguage information to improve matching or
language negotiation. For example, the information that 'sr'
(Serbian) and 'hr' (Croatian) share a macrolanguage expresses a
closer relation between those languages than between, say, 'sr'
(Serbian) and 'ma' (Macedonian). However, this relationship is not
guaranteed nor is it exclusive. For example, Romanian ('ro') and
Moldavian ('mo') do not share a macrolanguage, but are far more
closely related to each other than Cantonese ('yue') and Wu ('wuu'),
which do share a macrolanguage.
4.1.2. Using Extended Language Subtags
To accommodate language tag forms used prior to the adoption of this
document, language tags provide a special compatibility mechanism:
the extended language subtag. Selected languages have been provided
with both primary and extended language subtags. These include
macrolanguages, such as Malay ('ms') and Uzbek ('uz'), that have a
specific dominant variety that is generally synonymous with the
macrolanguage. Other languages, such as the Chinese ('zh') and
Arabic ('ar') macrolanguages and the various sign languages ('sgn'),
have traditionally used their primary language subtag, possibly
coupled with various region subtags or as part of a registered
grandfathered tag, to indicate the language.
With the adoption of this document, specific ISO 639-3 subtags became
available to identify the languages contained within these diverse
language families or groupings. This presents a choice of language
tags where previously none existed:
o Each encompassed language's subtag SHOULD be used as the primary
language subtag. For example, a document in Mandarin Chinese
would be tagged "cmn" (the subtag for Mandarin Chinese) in
preference to "zh" (Chinese).
o If compatibility is desired or needed, the encompassed subtag MAY
be used as an extended language subtag. For example, a document
in Mandarin Chinese could be tagged "zh-cmn" instead of either
"cmn" or "zh".
o The macrolanguage or prefixing subtag MAY still be used to form
the tag instead of the more specific encompassed language subtag.
That is, tags such as "zh-HK" or "sgn-RU" are still valid.
Chinese ('zh') provides a useful illustration of this. In the past,
various content has used tags beginning with the 'zh' subtag, with
application-specific meaning being associated with region codes,
private use sequences, or grandfathered registered values. This is
because historically only the macrolanguage subtag 'zh' was available
for forming language tags. However, the languages encompassed by the
Chinese subtag 'zh' are, in the main, not mutually intelligible when
spoken, and the written forms of these languages also show wide
variation in form and usage.
To provide compatibility, Chinese languages encompassed by the 'zh'
subtag are in the registry both as primary language subtags and as
extended language subtags. For example, the ISO 639-3 code for
Cantonese is 'yue'. Content in Cantonese might historically have
used a tag such as "zh-HK" (since Cantonese is commonly spoken in
Hong Kong), although that tag actually means any type of Chinese as
used in Hong Kong. With the availability of ISO 639-3 codes in the
registry, content in Cantonese can be directly tagged using the 'yue'
subtag. The content can use it as a primary language subtag, as in
the tag "yue-HK" (Cantonese, Hong Kong). Or it can use an extended
language subtag with 'zh', as in the tag "zh-yue-Hant" (Chinese,
Cantonese, Traditional script).
As noted above, applications can choose to use the macrolanguage
subtag to form the tag instead of using the more specific encompassed
language subtag. For example, an application with large quantities
of data already using tags with the 'zh' (Chinese) subtag might
continue to use this more general subtag even for new data, even
though the content could be more precisely tagged with 'cmn'
(Mandarin), 'yue' (Cantonese), 'wuu' (Wu), and so on. Similarly, an
application already using tags that start with the 'ar' (Arabic)
subtag might continue to use this more general subtag even for new
data, which could be more precisely tagged with 'arb' (Standard
In some cases, the encompassed languages had tags registered for them
during the RFC 3066 era. Those grandfathered tags not already
deprecated or rendered redundant were deprecated in the registry upon
adoption of this document. As grandfathered values, they remain
valid for use, and some content or applications might use them. As
with other grandfathered tags, since implementations might not be
able to associate the grandfathered tags with the encompassed
language subtag equivalents that are recommended by this document,
implementations are encouraged to canonicalize tags for comparison
purposes. Some examples of this include the tags "zh-hakka" (Hakka)
and "zh-guoyu" (Mandarin or Standard Chinese).
Sign languages share a mode of communication rather than a linguistic
heritage. There are many sign languages that have developed
independently, and the subtag 'sgn' indicates only the presence of a
sign language. A number of sign languages also had grandfathered
tags registered for them during the RFC 3066 era. For example, the
grandfathered tag "sgn-US" was registered to represent 'American Sign
Language' specifically, without reference to the United States. This
is still valid, but deprecated: a document in American Sign Language
can be labeled either "ase" or "sgn-ase" (the 'ase' subtag is for the
language called 'American Sign Language').
4.2. Meaning of the Language Tag
The meaning of a language tag is related to the meaning of the
subtags that it contains. Each subtag, in turn, implies a certain
range of expectations one might have for related content, although it
is not a guarantee. For example, the use of a script subtag such as
'Arab' (Arabic script) does not mean that the content contains only
Arabic characters. It does mean that the language involved is
predominantly in the Arabic script. Thus, a language tag and its
subtags can encompass a very wide range of variation and yet remain
appropriate in each particular instance.
Validity of a tag is not the only factor determining its usefulness.
While every valid tag has a meaning, it might not represent any real-
world language usage. This is unavoidable in a system in which
subtags can be combined freely. For example, tags such as
"ar-Cyrl-CO" (Arabic, Cyrillic script, as used in Colombia) or "tlh-
Kore-AQ-fonipa" (Klingon, Korean script, as used in Antarctica, IPA
phonetic transcription) are both valid and unlikely to represent a
useful combination of language attributes.
The meaning of a given tag doesn't depend on the context in which it
appears. The relationship between a tag's meaning and the
information objects to which that tag is applied, however, can vary.
o For a single information object, the associated language tags
might be interpreted as the set of languages that is necessary for
a complete comprehension of the complete object. Example: Plain
o For an aggregation of information objects, the associated language
tags could be taken as the set of languages used inside components
of that aggregation. Examples: Document stores and libraries.
o For information objects whose purpose is to provide alternatives,
the associated language tags could be regarded as a hint that the
content is provided in several languages and that one has to
inspect each of the alternatives in order to find its language or
languages. In this case, the presence of multiple tags might not
mean that one needs to be multilingual to get complete
understanding of the document. Example: MIME multipart/
o For markup languages, such as HTML and XML, language information
can be added to each part of the document identified by the markup
structure (including the whole document itself). For example, one
could write <span lang="fr">C'est la vie.</span> inside a German
document; the German-speaking user could then access a French-
German dictionary to find out what the marked section meant. If
the user were listening to that document through a speech
synthesis interface, this formation could be used to signal the
synthesizer to appropriately apply French text-to-speech
pronunciation rules to that span of text, instead of applying the
inappropriate German rules.
o For markup languages and document formats that allow the audience
to be identified, a language tag could indicate the audience(s)
appropriate for that document. For example, the same HTML
document described in the preceding bullet might have an HTTP
header "Content-Language: de" to indicate that the intended
audience for the file is German (even though three words appear
and are identified as being in French within it).
o For systems and APIs, language tags form the basis for most
implementations of locale identifiers. For example, see Unicode's
CLDR (Common Locale Data Repository) (see UTS #35 [UTS35])
Language tags are related when they contain a similar sequence of
subtags. For example, if a language tag B contains language tag A as
a prefix, then B is typically "narrower" or "more specific" than A.
Thus, "zh-Hant-TW" is more specific than "zh-Hant".
This relationship is not guaranteed in all cases: specifically,
languages that begin with the same sequence of subtags are NOT
guaranteed to be mutually intelligible, although they might be. For
example, the tag "az" shares a prefix with both "az-Latn"
(Azerbaijani written using the Latin script) and "az-Cyrl"
(Azerbaijani written using the Cyrillic script). A person fluent in
one script might not be able to read the other, even though the
linguistic content (e.g., what would be heard if both texts were read
aloud) might be identical. Content tagged as "az" most probably is
written in just one script and thus might not be intelligible to a
reader familiar with the other script.
Similarly, not all subtags specify an actual distinction in language.
For example, the tags "en-US" and "en-CA" mean, roughly, English with
features generally thought to be characteristic of the United States
and Canada, respectively. They do not imply that a significant
dialectical boundary exists between any arbitrarily selected point in
the United States and any arbitrarily selected point in Canada.
Neither does a particular region subtag imply that linguistic
distinctions do not exist within that region.
4.3. Lists of Languages
In some applications, a single content item might best be associated
with more than one language tag. Examples of such a usage include:
o Content items that contain multiple, distinct varieties. Often
this is used to indicate an appropriate audience for a given
content item when multiple choices might be appropriate. Examples
of this could include:
* Metadata about the appropriate audience for a movie title. For
example, a DVD might label its individual audio tracks 'de'
(German), 'fr' (French), and 'es' (Spanish), but the overall
title would list "de, fr, es" as its overall audience.
* A French/English, English/French dictionary tagged as both "en"
and "fr" to specify that it applies equally to French and
* A side-by-side or interlinear translation of a document, as is
commonly done with classical works in Latin or Greek.
o Content items that contain a single language but that require
multiple levels of specificity. For example, a library might wish
to classify a particular work as both Norwegian ('no') and as
Nynorsk ('nn') for audiences capable of appreciating the
distinction or needing to select content more narrowly.
4.4. Length Considerations
There is no defined upper limit on the size of language tags. While
historically most language tags have consisted of language and region
subtags with a combined total length of up to six characters, larger
tags have always been both possible and have actually appeared in
Neither the language tag syntax nor other requirements in this
document impose a fixed upper limit on the number of subtags in a
language tag (and thus an upper bound on the size of a tag). The
language tag syntax suggests that, depending on the specific
language, more subtags (and thus a longer tag) are sometimes
necessary to completely identify the language for certain
applications; thus, it is possible to envision long or complex subtag
4.4.1. Working with Limited Buffer Sizes
Some applications and protocols are forced to allocate fixed buffer
sizes or otherwise limit the length of a language tag. A conformant
implementation or specification MAY refuse to support the storage of
language tags that exceed a specified length. Any such limitation
SHOULD be clearly documented, and such documentation SHOULD include
what happens to longer tags (for example, whether an error value is
generated or the language tag is truncated). A protocol that allows
tags to be truncated at an arbitrary limit, without giving any
indication of what that limit is, has the potential to cause harm by
changing the meaning of tags in substantial ways.
In practice, most language tags do not require more than a few
subtags and will not approach reasonably sized buffer limitations;
see Section 4.1.
Some specifications or protocols have limits on tag length but do not
have a fixed length limitation. For example, [RFC2231] has no
explicit length limitation: the length available for the language tag
is constrained by the length of other header components (such as the
charset's name) coupled with the 76-character limit in [RFC2047].
Thus, the "limit" might be 50 or more characters, but it could
potentially be quite small.
The considerations for assigning a buffer limit are:
Implementations SHOULD NOT truncate language tags unless the
meaning of the tag is purposefully being changed, or unless the
tag does not fit into a limited buffer size specified by a
protocol for storage or transmission.
Implementations SHOULD warn the user when a tag is truncated since
truncation changes the semantic meaning of the tag.
Implementations of protocols or specifications that are space
constrained but do not have a fixed limit SHOULD use the longest
possible tag in preference to truncation.
Protocols or specifications that specify limited buffer sizes for
language tags MUST allow for language tags of at least 35
characters. Note that [RFC4646] recommended a minimum field size
of 42 characters because it included all three elements of the
'extlang' production. Two of these are now permanently reserved,
so a registered primary language subtag of the maximum length of 8
characters is now longer than the longest language-extlang
combination. Protocols or specifications that commonly use
extensions or private use subtags might wish to reserve or
recommend a longer "minimum buffer" size.
The following illustration shows how the 35-character recommendation
language = 8 ; longest allowed registered value
; longer than primary+extlang
; which requires 7 characters
script = 5 ; if not suppressed: see Section 4.1
region = 4 ; UN M.49 numeric region code
; ISO 3166-1 codes require 3
variant1 = 9 ; needs 'language' as a prefix
variant2 = 9 ; very rare, as it needs
; 'language-variant1' as a prefix
total = 35 characters
Figure 7: Derivation of the Limit on Tag Length4.4.2. Truncation of Language Tags
Truncation of a language tag alters the meaning of the tag, and thus
SHOULD be avoided. However, truncation of language tags is sometimes
necessary due to limited buffer sizes. Such truncation MUST NOT
permit a subtag to be chopped off in the middle or the formation of
invalid tags (for example, one ending with the "-" character).
This means that applications or protocols that truncate tags MUST do
so by progressively removing subtags along with their preceding "-"
from the right side of the language tag until the tag is short enough
for the given buffer. If the resulting tag ends with a single-
character subtag, that subtag and its preceding "-" MUST also be
removed. For example:
Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1
Figure 8: Example of Tag Truncation
4.5. Canonicalization of Language Tags
Since a particular language tag can be used by many processes,
language tags SHOULD always be created or generated in canonical
A language tag is in 'canonical form' when the tag is well-formed
according to the rules in Sections 2.1 and 2.2 and it has been
canonicalized by applying each of the following steps in order, using
data from the IANA registry (see Section 3.1):
1. Extension sequences are ordered into case-insensitive ASCII order
by singleton subtag.
* For example, the subtag sequence '-a-babble' comes before
2. Redundant or grandfathered tags are replaced by their 'Preferred-
Value', if there is one.
* The field-body of the 'Preferred-Value' for grandfathered and
redundant tags is an "extended language range" [RFC4647] and
might consist of more than one subtag.
* 'Preferred-Value' fields in the registry provide mappings from
deprecated tags to modern equivalents. Many of these were
created before the adoption of this document (such as the
mapping of "no-nyn" to "nn" or "i-klingon" to "tlh"). Others
are the result of later registrations or additions to the
registry as permitted or required by this document (for
example, "zh-hakka" was deprecated in favor of the ISO 639-3
code 'hak' when this document was adopted).
3. Subtags are replaced by their 'Preferred-Value', if there is one.
For extlangs, the original primary language subtag is also
replaced if there is a primary language subtag in the 'Preferred-
* The field-body of the 'Preferred-Value' for extlangs is an
"extended language range" and typically maps to a primary
language subtag. For example, the subtag sequence "zh-hak"
(Chinese, Hakka) is replaced with the subtag 'hak' (Hakka).
* Most of the non-extlang subtags are either Region subtags
where the country name or designation has changed or clerical
corrections to ISO 639-1.
The canonical form contains no 'extlang' subtags. There is an
alternate 'extlang form' that maintains or reinstates extlang
subtags. This form can be useful in environments where the presence
of the 'Prefix' subtag is considered beneficial in matching or
selection (see Section 4.1.2).
A language tag is in 'extlang form' when the tag is well-formed
according to the rules in Sections 2.1 and 2.2 and it has been
processed by applying each of the following two steps in order, using
data from the IANA registry:
1. The language tag is first transformed into canonical form, as
2. If the language tag starts with a primary language subtag that is
also an extlang subtag, then the language tag is prepended with
the extlang's 'Prefix'.
* For example, "hak-CN" (Hakka, China) has the primary language
subtag 'hak', which in turn has an 'extlang' record with a
'Prefix' 'zh' (Chinese). The extlang form is "zh-hak-CN"
(Chinese, Hakka, China).
* Note that Step 2 (prepending a prefix) can restore a subtag
that was removed by Step 1 (canonicalizing).
Example: The language tag "en-a-aaa-b-ccc-bbb-x-xyz" is in canonical
form, while "en-b-ccc-bbb-a-aaa-X-xyz" is well-formed and potentially
valid (extensions 'a' and 'b' are not defined as of the publication
of this document) but not in canonical form (the extensions are not
in alphabetical order).
Example: Although the tag "en-BU" (English as used in Burma)
maintains its validity, the language tag "en-BU" is not in canonical
form because the 'BU' subtag has a canonical mapping to 'MM'
Canonicalization of language tags does not imply anything about the
use of upper- or lowercase letters when processing or comparing
subtags (and as described in Section 2.1). All comparisons MUST be
performed in a case-insensitive manner.
When performing canonicalization of language tags, processors MAY
regularize the case of the subtags (that is, this process is
OPTIONAL), following the case used in the registry (see
If more than one variant appears within a tag, processors MAY reorder
the variants to obtain better matching behavior or more consistent
presentation. Reordering of the variants SHOULD follow the
recommendations for variant ordering in Section 4.1.
If the field 'Deprecated' appears in a registry record without an
accompanying 'Preferred-Value' field, then that tag or subtag is
deprecated without a replacement. These values are canonical when
they appear in a language tag. However, tags that include these
values SHOULD NOT be selected by users or generated by
An extension MUST define any relationships that exist between the
various subtags in the extension and thus MAY define an alternate
canonicalization scheme for the extension's subtags. Extensions MAY
define how the order of the extension's subtags is interpreted. For
example, an extension could define that its subtags are in canonical
order when the subtags are placed into ASCII order: that is, "en-a-
aaa-bbb-ccc" instead of "en-a-ccc-bbb-aaa". Another extension might
define that the order of the subtags influences their semantic
meaning (so that "en-b-ccc-bbb-aaa" has a different value from "en-b-
aaa-bbb-ccc"). However, extension specifications SHOULD be designed
so that they are tolerant of the typical processes described in
4.6. Considerations for Private Use Subtags
Private use subtags, like all other subtags, MUST conform to the
format and content constraints in the ABNF. Private use subtags have
no meaning outside the private agreement between the parties that
intend to use or exchange language tags that employ them. The same
subtags MAY be used with a different meaning under a separate private
agreement. They SHOULD NOT be used where alternatives exist and
SHOULD NOT be used in content or protocols intended for general use.
Private use subtags are simply useless for information exchange
without prior arrangement. The value and semantic meaning of private
use tags and of the subtags used within such a language tag are not
defined by this document.
Private use sequences introduced by the 'x' singleton are completely
opaque to users or implementations outside of the private use
agreement. So, in addition to private use subtag sequences
introduced by the singleton subtag 'x', the Language Subtag Registry
provides private use language, script, and region subtags derived
from the private use codes assigned by the underlying standards.
These subtags are valid for use in forming language tags; they are
RECOMMENDED over the 'x' singleton private use subtag sequences
because they convey more information via their linkage to the
language tag's inherent structure.
For example, the region subtags 'AA', 'ZZ', and those in the ranges
'QM'-'QZ' and 'XA'-'XZ' (derived from the ISO 3166-1 private use
codes) can be used to form a language tag. A tag such as
"zh-Hans-XQ" conveys a great deal of public, interchangeable
information about the language material (that it is Chinese in the
simplified Chinese script and is suitable for some geographic region
'XQ'). While the precise geographic region is not known outside of
private agreement, the tag conveys far more information than an
opaque tag such as "x-somelang" or even "zh-Hans-x-xq" (where the
'xq' subtag's meaning is entirely opaque).
However, in some cases content tagged with private use subtags can
interact with other systems in a different and possibly unsuitable
manner compared to tags that use opaque, privately defined subtags,
so the choice of the best approach sometimes depends on the
particular domain in question.
5. IANA Considerations
This section deals with the processes and requirements necessary for
IANA to maintain the subtag and extension registries as defined by
this document and in accordance with the requirements of [RFC5226].
The impact on the IANA maintainers of the two registries defined by
this document will be a small increase in the frequency of new
entries or updates. IANA also is required to create a new mailing
list (described below in Section 5.1) to announce registry changes
5.1. Language Subtag Registry
IANA updated the registry using instructions and content provided in
a companion document [RFC5645]. The criteria and process for
selecting the updated set of records are described in that document.
The updated set of records represents no impact on IANA, since the
work to create it will be performed externally.
Future work on the Language Subtag Registry includes the following
o Inserting or replacing whole records. These records are
preformatted for IANA by the Language Subtag Reviewer, as
described in Section 3.3.
o Archiving and making publicly available the registration forms.
o Announcing each updated version of the registry on the
"email@example.com" mailing list.
Each registration form sent to IANA contains a single record for
incorporation into the registry. The form will be sent to
<firstname.lastname@example.org> by the Language Subtag Reviewer. It will have a
subject line indicating whether the enclosed form represents an
insertion of a new record (indicated by the word "INSERT" in the
subject line) or a replacement of an existing record (indicated by
the word "MODIFY" in the subject line). At no time can a record be
deleted from the registry.
IANA will extract the record from the form and place the inserted or
modified record into the appropriate section of the Language Subtag
Registry, grouping the records by their 'Type' field. Inserted
records can be placed anywhere within the appropriate section; there
is no guarantee that the registry's records will be placed in any
particular order except that they will always be grouped by 'Type'.
Modified records overwrite the record they replace.
Whenever an entry is created or modified in the registry, the 'File-
Date' record at the start of the registry is updated to reflect the
most recent modification date. The date format SHALL be the "full-
date" format of [RFC3339]. The date SHALL be the date on which that
version of the registry was first published by IANA. There SHALL be
at most one version of the registry published in a day. A 'File-
Date' record is also included in each request to IANA to insert or
modify records, indicating the acceptance date of the records in the
The updated registry file MUST use the UTF-8 character encoding, and
IANA MUST check the registry file for proper encoding. Non-ASCII
characters can be sent to IANA by attaching the registration form to
the email message or by using various encodings in the mail message
body (UTF-8 is recommended). IANA will verify any unclear or
corrupted characters with the Language Subtag Reviewer prior to
posting the updated registry.
IANA will also archive and make publicly available from
http://www.iana.org each registration form. Note that multiple
registrations can pertain to the same record in the registry.
Developers who are dependent upon the Language Subtag Registry
sometimes would like to be informed of changes in the registry so
that they can update their implementations. When any change is made
to the Language Subtag Registry, IANA will send an announcement
message to <email@example.com> (a self-
subscribing list to which only IANA can post).
5.2. Extensions Registry
The Language Tag Extensions Registry can contain at most 35 records,
and thus changes to this registry are expected to be very infrequent.
Future work by IANA on the Language Tag Extensions Registry is
limited to two cases. First, the IESG MAY request that new records
be inserted into this registry from time to time. These requests
MUST include the record to insert in the exact format described in
Section 3.7. In addition, there MAY be occasional requests from the
maintaining authority for a specific extension to update the contact
information or URLs in the record. These requests MUST include the
complete, updated record. IANA is not responsible for validating the
information provided, only that it is properly formatted. IANA
SHOULD take reasonable steps to ascertain that the request comes from
the maintaining authority named in the record present in the