Tech-invite3GPPspecsGlossariesIETFRFCsGroupsSIPABNFsWorld Map

RFC 7940


Representing Label Generation Rulesets Using XML

Part 2 of 4, p. 23 to 40
Prev Section       Next Section


prevText      Top      ToC       Page 23 
6.  Whole Label and Context Evaluation

6.1.  Basic Concepts

   The "rules" element contains the specification of both context-based
   and whole label rules.  Collectively, these are known as Whole Label
   Evaluation (WLE) rules (Section 6.3).  The "rules" element also
   contains the character classes (Section 6.2) that they depend on, and
   any actions (Section 7) that assign dispositions to labels based on
   rules or variant mappings.

Top      Up      ToC       Page 24 
   A whole label rule is applied to the whole label.  It is used to
   validate both original labels and any variant labels computed
   from them.

   A rule implementing a conditional context as discussed in Section 5.2
   does not necessarily apply to the whole label but may be specific to
   the context around a single code point or code point sequence.
   Certain code points in a label sometimes need to satisfy
   context-based rules -- for example, for the label to be considered
   valid, or to satisfy the context for a variant mapping (see the
   description of the "when" attribute in Section 6.4).

   For example, if a rule is referenced in the "when" attribute of a
   variant mapping, it is used to describe the conditional context under
   which the particular variant mapping is defined to exist.

   Each rule is defined in a "rule" element.  A rule may contain the
   following as child elements:

   o  literal code points or code point sequences

   o  character classes, which define sets of code points to be used for
      context comparisons

   o  context operators, which define when character classes and
      literals may appear

   o  nested rules, whether defined in place or invoked by reference

   Collectively, these are called "match operators" and are listed in
   Section 6.3.2.  An LGR containing rules or match operators that

   1.  are incorrectly defined or nested,

   2.  have invalid attributes, or

   3.  have invalid or undefined attribute values

   MUST be rejected.  Note that not all of the constraints defined here
   are validated by the schema.

Top      Up      ToC       Page 25 
6.2.  Character Classes

   Character classes are sets of characters that often share a
   particular property.  While they function like sets in every way,
   even supporting the usual set operators, they are called "character
   classes" here in a nod to the use of that term in regular expression
   syntax.  (This also avoids confusion with the term "character set" in
   the sense of character encoding.)

   Character classes can be specified in several ways:

   o  by defining the class via matching a tag in the code point data.
      All characters with the same "tag" attribute are part of the same

   o  by referencing a value of one of the Unicode character properties
      defined in the Unicode Character Database;

   o  by explicitly listing all the code points in the class; or

   o  by defining the class as a set combination of any number of other

6.2.1.  Declaring and Invoking Named Classes

   A character class has an OPTIONAL "name" attribute consisting of a
   single identifier not containing spaces.  All names for classes must
   be unique.  If the "name" attribute is omitted, the class is
   anonymous and exists only inside the rule or combined class where it
   is defined.  A named character class is defined independently and can
   be referenced by name from within any rules or as part of other
   character class definitions.

       <class name="example" comment="an example class definition">
           0061 4E00
           <class by-ref="example" />

   An empty "class" element with a "by-ref" attribute is a reference to
   an existing named class.  The "by-ref" attribute MUST NOT be used in
   the same "class" element with any of these attributes: "name",
   "from-tag", "property", or "ref".  The "name" attribute MUST be
   present if and only if the class is a direct child element of the
   "rules" element.  It is an error to reference a named class for which
   the definition has not been seen.

Top      Up      ToC       Page 26 
6.2.2.  Tag-Based Classes

   The "char" or "range" elements that are child elements of the "data"
   element MAY contain a "tag" attribute that consists of one or more
   space-separated tag values; for example:

       <char cp="0061" tag="letter lower"/>
       <char cp="4E00" tag="letter"/>

   This defines two tags for use with code point U+0061, the tag
   "letter" and the tag "lower".  Use

       <class name="letter" from-tag="letter" />
       <class name="lower" from-tag="lower" />

   to define two named character classes, "letter" and "lower",
   containing all code points with the respective tags, the first with
   0061 and 4E00 as elements, and the latter with 0061 but not 4E00 as
   an element.  The "name" attribute may be omitted for an anonymous
   in-place definition of a nested, tag-based class.

   Tag values are typically identifiers, with the addition of a few
   punctuation symbols, such as a colon.  Formally, they MUST correspond
   to the XML 1.0 Nmtoken production.  While a "tag" attribute may
   contain a list of tag values, the "from-tag" attribute MUST always
   contain a single tag value.

   If the document contains no "char" or "range" elements with a
   corresponding tag, the character class represents the empty set.
   This is valid, to allow a common "rules" element to be shared across
   files.  However, it is RECOMMENDED that implementations allow for a
   warning to ensure that referring to an undefined tag in this way is

6.2.3.  Unicode Property-Based Classes

   A class is defined in terms of Unicode properties by giving the
   Unicode property alias and the property value or property value
   alias, separated by a colon.

       <class name="virama" property="ccc:9" />

   The example above selects all code points for which the Unicode
   Canonical Combining Class (ccc) value is 9.  This value of the ccc is
   assigned to all code points that encode viramas.

Top      Up      ToC       Page 27 
   Unicode property values MUST be designated via a composite of the
   attribute name and value as defined for the property value in
   [UAX42], separated by a colon.  Loose matching of property values and
   names as described in [UAX44] is not appropriate for an XML schema
   and is not supported; it is likewise not supported in the XML
   representation [UAX42] of the Unicode Character Database itself.

   A property-based class MAY be anonymous, or, when defined as an
   immediate child of the "rules" element, it MAY be named to relate a
   formal property definition to its usage, such as the use of the value
   9 for ccc to designate a virama (or halant) in various scripts.

   Unicode properties may, in principle, change between versions of the
   Unicode Standard.  However, the values assigned for a given version
   are fixed.  If Unicode properties are used, a Unicode version MUST be
   declared in the "unicode-version" element in the header.  (Note: Some
   Unicode properties are by definition stable across versions and do
   not change once assigned; see [Unicode-Stability].)

   All implementations processing LGR files SHOULD provide support for
   the following minimal set of Unicode properties:

   o  General Category (gc)

   o  Script (sc)

   o  Canonical Combining Class (ccc)

   o  Bidi Class (bc)

   o  Arabic Joining Type (jt)

   o  Indic Syllabic Category (InSC)

   o  Deprecated (Dep)

   The short name for each property is given in parentheses.

   If a program that is using an LGR to determine the validity of a
   label encounters a property that it does not support, it MUST abort
   with an error.

Top      Up      ToC       Page 28 
6.2.4.  Explicitly Declared Classes

   A class of code points may also be declared by listing all code
   points that are members of the class.  This is useful when tagging
   cannot be used because code points are not listed individually as
   part of the eligible set of code points for the given LGR -- for
   example, because they only occur in code point sequences.

   To define a class in terms of an explicit list of code points, use a
   space-separated list of hexadecimal code point values:

       <class name="abcd">0061 0062 0063 0064</class>

   This defines a class named "abcd" containing the code points for
   characters "a", "b", "c", and "d".  The ordering of the code points
   is not material, but it is RECOMMENDED to list them in ascending
   order; not doing so makes it unnecessarily difficult for users to
   detect errors such as duplicates or to compare and review these
   classes against other specifications.

   In a class definition, ranges of code points are represented by a
   hexadecimal start and end value separated by a hyphen.  The following
   declaration is equivalent to the preceding:

       <class name="abcd">0061-0064</class>

   Range and code point declarations can be freely intermixed:

       <class name="abcd">0061 0062-0063 0064</class>

   The contents of a class differ from a repertoire in that the latter
   MAY contain sequences as elements, while the former MUST NOT.
   Instead, they closely resemble character classes as found in regular

Top      Up      ToC       Page 29 
6.2.5.  Combined Classes

   Classes may be combined using operators for set complement, union,
   intersection, difference (elements of the first class that are not in
   the second), and symmetric difference (elements in either class but
   not both).  Because classes fundamentally function like sets, the
   union of several character classes is itself a class, for example.

   | Logical Operation | Example                                      |
   | Complement        | <complement><class by-ref="xxx"></complement>|
   | Union             | <union>                                      |
   |                   |    <class by-ref="class-1"/>                 |
   |                   |    <class by-ref="class-2"/>                 |
   |                   |    <class by-ref="class-3"/>                 |
   |                   | </union>                                     |
   | Intersection      | <intersection>                               |
   |                   |    <class by-ref="class-1"/>                 |
   |                   |    <class by-ref="class-2"/>                 |
   |                   | </intersection>                              |
   | Difference        | <difference>                                 |
   |                   |    <class by-ref="class-1"/>                 |
   |                   |    <class by-ref="class-2"/>                 |
   |                   | </difference>                                |
   | Symmetric         | <symmetric-difference>                       |
   | Difference        |    <class by-ref="class-1"/>                 |
   |                   |    <class by-ref="class-2"/>                 |
   |                   | </symmetric-difference>                      |

                               Set Operators

   The elements from this table may be arbitrarily nested inside each
   other, subject to the following restriction: a "complement" element
   MUST contain precisely one "class" or one of the operator elements,
   while an "intersection", "symmetric-difference", or "difference"
   element MUST contain precisely two, and a "union" element MUST
   contain two or more of these elements.

Top      Up      ToC       Page 30 
   An anonymous combined class can be defined directly inside a rule or
   any of the match operator elements that allow child elements (see
   Section 6.3.2) by using the set combination as the outer element.

               <class by-ref="xxx"/>
               <class by-ref="yyy"/>

   The example shows the definition of an anonymous combined class that
   represents the union of classes "xxx" and "yyy".  There is no need to
   wrap this union inside another "class" element, and, in fact, set
   combination elements MUST NOT be nested inside a "class" element.

   Lastly, to create a named combined class that can be referenced in
   other classes or in rules as <class by-ref="xxxyyy"/>, add a "name"
   attribute to the set combination element -- for example,
   <union name="xxxyyy" /> -- and place it at the top level immediately
   below the "rules" element (see Section 6.2.1).

          <union name="xxxyyy">
              <class by-ref="xxx"/>
              <class by-ref="yyy"/>

   Because (as for ordinary sets) a combination of classes is itself a
   class, no matter by what combinations of set operators a combined
   class is created, a reference to it always uses the "class" element
   as described in Section 6.2.1.  That is, a named class is always
   referenced via an empty "class" element using the "by-ref" attribute
   containing the name of the class to be referenced.

6.3.  Whole Label and Context Rules

   Each rule comprises a series of matching operators that must be
   satisfied in order to determine whether a label meets a given
   condition.  Rules may reference other rules or character classes
   defined elsewhere in the table.

Top      Up      ToC       Page 31 
6.3.1.  The "rule" Element

   A matching rule is defined by a "rule" element, the child elements of
   which are one of the match operators from Section 6.3.2.  In
   evaluating a rule, each child element is matched in order.  "rule"
   elements MAY be nested inside each other and inside certain match

   A simple rule to match a label where all characters are members of
   some class called "preferred-codepoint":

       <rule name="preferred-label">
           <start />
           <class by-ref="preferred-codepoint" count="1+"/>
           <end />

   Rules are paired with explicit and implied actions, triggering these
   actions when a rule matches a label.  For example, a simple explicit
   action for the rule shown above would be:

       <action disp="allocatable" match="preferred-label" />

   The rule in this example would have the effect of setting the policy
   disposition for a label made up entirely of preferred code points to
   "allocatable".  Explicit actions are further discussed in Section 7
   and implicit actions in Section 7.5.  Another use of rules is in
   defining conditional contexts for code points and variants as
   discussed in Sections 5.2 and 5.3.5.

   A rule that is an immediate child element of the "rules" element MUST
   be named using a "name" attribute containing a single identifier
   string with no spaces.  A named rule may be incorporated into another
   rule by reference and may also be referenced by an "action" element,
   "when" attribute, or "not-when" attribute.  If the "name" attribute
   is omitted, the rule is anonymous and MUST be nested inside another
   rule or match operator.

Top      Up      ToC       Page 32 
6.3.2.  The Match Operators

   The child elements of a rule are a series of match operators, which
   are listed here by type and name and with a basic example or two.

   | Type       | Operator    | Examples                           |
   | logical    | any         | <any />                            |
   |            +-------------+------------------------------------+
   |            | choice      | <choice>                           |
   |            |             |  <rule by-ref="alternative1"/>     |
   |            |             |  <rule by-ref="alternative2"/>     |
   |            |             | </choice>                          |
   | positional | start       | <start />                          |
   |            +-------------+------------------------------------+
   |            | end         | <end />                            |
   | literal    | char        | <char cp="0061 0062 0063" />       |
   | set        | class       | <class by-ref="class1" />          |
   |            |             | <class>0061 0064-0065</class>      |
   | group      | rule        | <rule by-ref="rule1" />            |
   |            |             | <rule><any /></rule>               |
   | contextual | anchor      | <anchor />                         |
   |            +-------------+------------------------------------+
   |            | look-ahead  | <look-ahead><any /></look-ahead>   |
   |            +-------------+------------------------------------+
   |            | look-behind | <look-behind><any /></look-behind> |

                              Match Operators

   Any element defining an anonymous class can be used as a match
   operator, including any of the set combination operators (see
   Section 6.2.5) as well as references to named classes.

   All match operators shown as empty elements in the Examples column of
   the table above do not support child elements of their own;
   otherwise, match operators MAY be nested.  In particular, anonymous
   "rule" elements can be used for grouping.

Top      Up      ToC       Page 33 
6.3.3.  The "count" Attribute

   The OPTIONAL "count" attribute, when present, specifies the minimally
   required or maximal permitted number of times a match operator is
   used to match input.  If the "count" attribute is

   n    the match operator matches the input exactly n times, where n is
        1 or greater.

   n+   the match operator matches the input at least n times, where n
        is 0 or greater.

   n:m  the match operator matches the input at least n times, where n
        is 0 or greater, but matches the input up to m times in total,
        where m > n.  If m = n and n > 0, the match operator matches the
        input exactly n times.

   If there is no "count" attribute, the match operator matches the
   input exactly once.

   In matching, greedy evaluation is used in the sense defined for
   regular expressions: beyond the required number or times, the input
   is matched as many times as possible, but not so often as to prevent
   a match of the remainder of the rule.

   A "count" attribute MUST NOT be applied to any element that contains
   a "name" attribute but MAY be applied to operators such as "class"
   that declare anonymous classes (including combined classes) or invoke
   any predefined classes by reference.  The "count" attribute MUST NOT
   be applied to any "class" element, or element defining a combined
   class, when it is nested inside a combined class.

   A "count" attribute MUST NOT be applied to match operators of type
   "start", "end", "anchor", "look-ahead", or "look-behind" or to any
   operators, such as "rule" or "choice", that contain a nested instance
   of them.  This limitation applies recursively and irrespective of
   whether a "rule" element containing these nested instances is
   declared in place or used by reference.

   However, the "count" attribute MAY be applied to any other instances
   of either an anonymous "rule" element or a "choice" element,
   including those instances nested inside other match operators.  It
   MAY also be applied to the elements "any" and "char", when used as
   match operators.

Top      Up      ToC       Page 34 
6.3.4.  The "name" and "by-ref" Attributes

   Like classes (see Section 6.2.1), rules declared as immediate child
   elements of the "rules" element MUST be named using a unique "name"
   attribute, and all other instances MUST NOT be named.  Anonymous
   rules and classes or references to named rules and classes can be
   nested inside other match operators by reference.

   To reference a named rule or class inside a rule or match operator,
   use a "rule" or "class" element with an OPTIONAL "by-ref" attribute
   containing the name of the referenced element.  It is an error to
   reference a rule or class for which the complete definition has not
   been seen.  In other words, it is explicitly not possible to define
   recursive rules or class definitions.  The "by-ref" attribute
   MUST NOT appear in the same element as the "name" attribute or in an
   element that has any child elements.

   The example shows several named classes and a named rule referencing
   some of them by name.

       <class name="letter" property="gc:L"/>
       <class name="combining-mark" property="gc:M"/>
       <class name="digit" property="gc:Nd" />
       <rule name="letter-grapheme">
          <class by-ref="letter" count="1+"/>
          <class by-ref="combining-mark" count="0+"/>

6.3.5.  The "choice" Element

   The "choice" element is used to represent a list of two or more

       <rule name="ldh">
          <choice count="1+">
              <class by-ref="letter"/>
              <class by-ref="digit"/>
              <char cp="002D" comment="literal HYPHEN"/>

   Each child element of a "choice" element represents one alternative.
   The first matching alternative determines the match for the
   "choice" element.  To express a choice where an alternative itself
   consists of a sequence of elements, the sequence must be wrapped in
   an anonymous rule.

Top      Up      ToC       Page 35 
6.3.6.  Literal Code Point Sequences

   A literal code point sequence matches a single code point or a
   sequence.  It is defined by a "char" element, with the code point or
   sequence to be matched given by the "cp" attribute.  When used as a
   literal, a "char" element MAY contain a "count" attribute in addition
   to the "cp" attribute and OPTIONAL "comment" or "ref" attributes.  No
   other attributes or child elements are permitted.

6.3.7.  The "any" Element

   The "any" element is an empty element that matches any single code
   point.  It MAY have a "count" attribute.  For an example, see
   Section 6.3.9.

   Unlike a literal, the "any" element MUST NOT have a "ref" attribute.

6.3.8.  The "start" and "end" Elements

   To match the beginning or end of a label, use the "start" or "end"
   element.  An empty label would match this rule:

       <rule name="empty-label">

   Conceptually, whole label rules evaluate the label as a whole, but in
   practice, many rules do not actually need to be specified to match
   the entire label.  For example, to express a requirement of not
   starting a label with a digit, a rule needs to describe only the
   initial part of a label.

   This example uses the previously defined rules, together with "start"
   and "end" elements, to define a rule that requires that an entire
   label be well-formed.  For this example, that means that it must
   start with a letter and that it contains no leading digits or
   combining marks nor combining marks placed on digits.

       <rule name="leading-letter" >
         <start />
         <rule by-ref="letter-grapheme" count="1"/>
         <choice count="0+">
           <rule by-ref="letter-grapheme" count="0+"/>
           <class by-ref="digit" count="0+"/>
         <end />

Top      Up      ToC       Page 36 
   Each "start" or "end" element occurs at most once in a rule, except
   if nested inside a "choice" element in such a way that in matching
   each alternative at most one occurrence of each is encountered.
   Otherwise, the result is an error, as is any case where a "start" or
   "end" element is not encountered as the first or last element to be
   matched, respectively, in matching a rule.  "start" and "end"
   elements are empty elements that do not have a "count" attribute or
   any other attribute other than "comment".  It is an error for any
   match operator enclosing a nested "start" or "end" element to have a
   "count" attribute.

6.3.9.  Example Context Rule from IDNA Specification

   This is an example of the WLE rule from [RFC5892] forbidding the
   mixture of the Arabic-Indic and extended Arabic-Indic digits in the
   same label.  It is implemented as a whole label rule associated with
   the code point ranges using the "not-when" attribute, which defines
   an impermissible context.  The example also demonstrates several
   instances of the use of anonymous rules for grouping.

          <range first-cp="0660" last-cp="0669" not-when="mixed-digits"
                 tag="arabic-indic-digits" />
          <range first-cp="06F0" last-cp="06F9" not-when="mixed-digits"
                 tag="extended-arabic-indic-digits" />
          <rule name="mixed-digits">
                   <class from-tag="arabic-indic-digits"/>
                   <any count="0+"/>
                   <class from-tag="extended-arabic-indic-digits"/>
                   <class from-tag="extended-arabic-indic-digits"/>
                   <any count="0+"/>
                   <class from-tag="arabic-indic-digits"/>

   As specified in the example, a label containing a code point from
   either of the two digit ranges is invalid for any label matching the
   "mixed-digits" rule, that is, any time that a code point from the
   other range is also present.  Note that invalidating the label is not

Top      Up      ToC       Page 37 
   the same as invalidating the definition of the "range" elements; in
   particular, the definition of the tag values does not depend on the
   "when" attribute.

6.4.  Parameterized Context or When Rules

   To recap: When a rule is intended to provide a context for evaluating
   the validity of a code point or variant mapping, it is invoked by the
   "when" or "not-when" attributes described in Section 5.2.  For "char"
   and "range" elements, an action implied by a context rule always has
   a disposition of "invalid" whenever the rule given by the "when"
   attribute is not matched (see Section 7.5).  Conversely, a "not-when"
   attribute results in a disposition of "invalid" whenever the rule is
   matched.  When a rule is used in this way, it is called a context or
   "when" rule.

   The example in the previous section shows a whole label rule used as
   a context rule, essentially making the whole label the context.  The
   next sections describe several match operators that can be used to
   provide a more specific specification of a context, allowing a
   parameterized context rule.  See Section 7 for an alternative method
   of defining an invalid disposition for a label not matching a whole
   label rule.

6.4.1.  The "anchor" Element

   Such parameterized context rules are rules that contain a special
   placeholder represented by an "anchor" element.  As each When Rule is
   evaluated, if an "anchor" element is present, it is replaced by a
   literal corresponding to the "cp" attribute of the element containing
   the "when" (or "not-when") attribute.  The match to the "anchor"
   element must be at the same position in the label as the code point
   or variant mapping triggering the When Rule.

   For example, the Greek lower numeral sign is invalid if not
   immediately preceding a character in the Greek script.  This is most
   naturally addressed with a parameterized When Rule using

       <char cp="0375" when="preceding-greek"/>
       <class name="greek-script" property="sc:Grek"/>
       <rule name="preceding-greek">
               <class by-ref="greek-script"/>

Top      Up      ToC       Page 38 
   In evaluating this rule, the "anchor" element is treated as if it was
   replaced by a literal

       <char cp="0375"/>

   but only the instance of U+0375 at the given position is evaluated.
   If a label had two instances of U+0375 with the first one matching
   the rule and the second not, then evaluating the When Rule MUST
   succeed for the first instance and fail for the second.

   Unlike other rules, rules containing an "anchor" element MUST only be
   invoked via the "when" or "not-when" attributes on code points or
   variants; otherwise, their "anchor" elements cannot be evaluated.
   However, it is possible to invoke rules not containing an "anchor"
   element from a "when" or "not-when" attribute.  (See Section 6.4.3.)

   The "anchor" element is an empty element, with no attributes
   permitted except "comment".

6.4.2.  The "look-behind" and "look-ahead" Elements

   Context rules use the "look-behind" and "look-ahead" elements to
   define context before and after the code point sequence matched by
   the "anchor" element.  If the "anchor" element is omitted, neither
   the "look-behind" nor the "look-ahead" element may be present in
   a rule.

Top      Up      ToC       Page 39 
   Here is an example of a rule that defines an "initial" context for an
   Arabic code point:

       <class name="transparent" property="jt:T"/>
       <class name="right-joining" property="jt:R"/>
       <class name="left-joining" property="jt:L"/>
       <class name="dual-joining" property="jt:D"/>
       <class name="non-joining" property="jt:U"/>
       <rule name="Arabic-initial">
               <class by-ref="transparent" count="0+"/>
               <class by-ref="non-joining"/>
           <class by-ref="transparent" count="0+" />
             <class by-ref="right-joining" />
             <class by-ref="dual-joining" />

   A "when" rule (or context rule) is a named rule that contains any
   combination of "look-behind", "anchor", and "look-ahead" elements, in
   that order.  Each of these elements occurs at most once, except if
   nested inside a "choice" element in such a way that in matching each
   alternative at most one occurrence of each is encountered.
   Otherwise, the result is undefined.  None of these elements takes a
   "count" attribute, nor does any enclosing match operator; otherwise,
   the result is undefined.  If a context rule contains a "look-ahead"
   or "look-behind" element, it MUST contain an "anchor" element.  If,
   because of a "choice" element, a required anchor is not actually
   encountered, the results are undefined.

Top      Up      ToC       Page 40 
6.4.3.  Omitting the "anchor" Element

   If the "anchor" element is omitted, the evaluation of the context
   rule is not tied to the position of the code point or sequence
   associated with the "when" attribute.

   According to [RFC5892], the Katakana middle dot is invalid in any
   label not containing at least one Japanese character anywhere in the
   label.  Because this requirement is independent of the position of
   the middle dot, the rule does not require an "anchor" element.

       <char cp="30FB" when="japanese-in-label"/>
       <rule name="japanese-in-label">
               <class property="sc:Hani"/>
               <class property="sc:Kata"/>
               <class property="sc:Hira"/>

   The Katakana middle dot is used only with Han, Katakana, or Hiragana.
   The corresponding When Rule requires that at least one code point in
   the label be in one of these scripts, but the position of that code
   point is independent of the location of the middle dot; therefore, no
   anchor is required.  (Note that the Katakana middle dot itself is of
   script Common, that is, "sc:Zyyy".)

(page 40 continued on part 3)

Next Section