RFC 7564

PRECIS Framework: Preparation, Enforcement, and Comparison of Internationalized Strings in Application Protocols

Pages: 40
Obsoletes: 3454
Obsoleted by: 8264

Part 2 of 3 – Pages 17 to 26

noToC RFC7564 - Page 17 prevText

6.  Applications

6.1.  How to Use PRECIS in Applications

   Although PRECIS has been designed with applications in mind,
   internationalization is not suddenly made easy through the use of
   PRECIS.  Application developers still need to give some thought to
   how they will use the PRECIS string classes, or profiles thereof, in
   their applications.  This section provides some guidelines to
   application developers (and to expert reviewers of application
   protocol specifications).

   o  Don't define your own profile unless absolutely necessary (see
      Section 5.1).  Existing profiles have been designed for wide
      reuse.  It is highly likely that an existing profile will meet
      your needs, especially given the ability to specify further
      excluded characters (Section 6.2) and to build application-layer
      constructs (see Section 6.3).

   o  Do specify:

      *  Exactly which entities are responsible for preparation,
         enforcement, and comparison of internationalized strings (e.g.,
         servers or clients).

      *  Exactly when those entities need to complete their tasks (e.g.,
         a server might need to enforce the rules of a profile before
         allowing a client to gain network access).

noToC RFC7564 - Page 18

      *  Exactly which protocol slots need to be checked against which
         profiles (e.g., checking the address of a message's intended
         recipient against the UsernameCaseMapped profile
         [PRECIS-Users-Pwds] of the IdentifierClass, or checking the
         password of a user against the OpaqueString profile
         [PRECIS-Users-Pwds] of the FreeformClass).

      See [PRECIS-Users-Pwds] and [XMPP-Addr-Format] for definitions of
      these matters for several applications.

6.2.  Further Excluded Characters

   An application protocol that uses a profile MAY specify particular
   code points that are not allowed in relevant slots within that
   application protocol, above and beyond those excluded by the string
   class or profile.

   That is, an application protocol MAY do either of the following:

   1.  Exclude specific code points that are allowed by the relevant
       string class.

   2.  Exclude characters matching certain Unicode properties (e.g.,
       math symbols) that are included in the relevant PRECIS string
       class.

   As a result of such exclusions, code points that are defined as valid
   for the PRECIS string class or profile will be defined as disallowed
   for the relevant protocol slot.

   Typically, such exclusions are defined for the purpose of backward
   compatibility with legacy formats within an application protocol.
   These are defined for application protocols, not profiles, in order
   to prevent multiplication of profiles beyond necessity (see
   Section 5.1).

6.3.  Building Application-Layer Constructs

   Sometimes, an application-layer construct does not map in a
   straightforward manner to one of the base string classes or a profile
   thereof.  Consider, for example, the "simple user name" construct in
   the Simple Authentication and Security Layer (SASL) [RFC4422].
   Depending on the deployment, a simple user name might take the form
   of a user's full name (e.g., the user's personal name followed by a
   space and then the user's family name).  Such a simple user name
   cannot be defined as an instance of the IdentifierClass or a profile
   thereof, since space characters are not allowed in the

noToC RFC7564 - Page 19

   IdentifierClass; however, it could be defined using a space-separated
   sequence of IdentifierClass instances, as in the following ABNF
   [RFC5234] from [PRECIS-Users-Pwds]:

      username   = userpart *(1*SP userpart)
      userpart   = 1*(idbyte)
                   ;
                   ; an "idbyte" is a byte used to represent a
                   ; UTF-8 encoded Unicode code point that can be
                   ; contained in a string that conforms to the
                   ; PRECIS "IdentifierClass"
                   ;

   Similar techniques could be used to define many application-layer
   constructs, say of the form "user@domain" or "/path/to/file".

7.  Order of Operations

   To ensure proper comparison, the rules specified for a particular
   string class or profile MUST be applied in the following order:

   1.  Width Mapping Rule

   2.  Additional Mapping Rule

   3.  Case Mapping Rule

   4.  Normalization Rule

   5.  Directionality Rule

   6.  Behavioral rules for determining whether a code point is valid,
       allowed under a contextual rule, disallowed, or unassigned

   As already described, the width mapping, additional mapping, case
   mapping, normalization, and directionality rules are specified for
   each profile, whereas the behavioral rules are specified for each
   string class.  Some of the logic behind this order is provided under
   Section 5.2.1 (see also the PRECIS mappings document
   [PRECIS-Mappings]).

noToC RFC7564 - Page 20

8.  Code Point Properties

   In order to implement the string classes described above, this
   document does the following:

   1.  Reviews and classifies the collections of code points in the
       Unicode character set by examining various code point properties.

   2.  Defines an algorithm for determining a derived property value,
       which can vary depending on the string class being used by the
       relevant application protocol.

   This document is not intended to specify precisely how derived
   property values are to be applied in protocol strings.  That
   information is the responsibility of the protocol specification that
   uses or profiles a PRECIS string class from this document.  The value
   of the property is to be interpreted as follows.

   PROTOCOL VALID  Those code points that are allowed to be used in any
      PRECIS string class (currently, IdentifierClass and
      FreeformClass).  The abbreviated term "PVALID" is used to refer to
      this value in the remainder of this document.

   SPECIFIC CLASS PROTOCOL VALID  Those code points that are allowed to
      be used in specific string classes.  In the remainder of this
      document, the abbreviated term *_PVAL is used, where * = (ID |
      FREE), i.e., either "FREE_PVAL" or "ID_PVAL".  In practice, the
      derived property ID_PVAL is not used in this specification, since
      every ID_PVAL code point is PVALID.

   CONTEXTUAL RULE REQUIRED  Some characteristics of the character, such
      as its being invisible in certain contexts or problematic in
      others, require that it not be used in labels unless specific
      other characters or properties are present.  As in IDNA2008, there
      are two subdivisions of CONTEXTUAL RULE REQUIRED -- the first for
      Join_controls (called "CONTEXTJ") and the second for other
      characters (called "CONTEXTO").  A character with the derived
      property value CONTEXTJ or CONTEXTO MUST NOT be used unless an
      appropriate rule has been established and the context of the
      character is consistent with that rule.  The most notable of the
      CONTEXTUAL RULE REQUIRED characters are the Join Control
      characters U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH
      NON-JOINER, which have a derived property value of CONTEXTJ.  See
      Appendix A of [RFC5892] for more information.

   DISALLOWED  Those code points that are not permitted in any PRECIS
      string class.

noToC RFC7564 - Page 21

   SPECIFIC CLASS DISALLOWED  Those code points that are not to be
      included in one of the string classes but that might be permitted
      in others.  In the remainder of this document, the abbreviated
      term *_DIS is used, where * = (ID | FREE), i.e., either "FREE_DIS"
      or "ID_DIS".  In practice, the derived property FREE_DIS is not
      used in this specification, since every FREE_DIS code point is
      DISALLOWED.

   UNASSIGNED  Those code points that are not designated (i.e., are
      unassigned) in the Unicode Standard.

   The algorithm to calculate the value of the derived property is as
   follows (implementations MUST NOT modify the order of operations
   within this algorithm, since doing so would cause inconsistent
   results across implementations):

   If .cp. .in. Exceptions Then Exceptions(cp);
   Else If .cp. .in. BackwardCompatible Then BackwardCompatible(cp);
   Else If .cp. .in. Unassigned Then UNASSIGNED;
   Else If .cp. .in. ASCII7 Then PVALID;
   Else If .cp. .in. JoinControl Then CONTEXTJ;
   Else If .cp. .in. OldHangulJamo Then DISALLOWED;
   Else If .cp. .in. PrecisIgnorableProperties Then DISALLOWED;
   Else If .cp. .in. Controls Then DISALLOWED;
   Else If .cp. .in. HasCompat Then ID_DIS or FREE_PVAL;
   Else If .cp. .in. LetterDigits Then PVALID;
   Else If .cp. .in. OtherLetterDigits Then ID_DIS or FREE_PVAL;
   Else If .cp. .in. Spaces Then ID_DIS or FREE_PVAL;
   Else If .cp. .in. Symbols Then ID_DIS or FREE_PVAL;
   Else If .cp. .in. Punctuation Then ID_DIS or FREE_PVAL;
   Else DISALLOWED;

   The value of the derived property calculated can depend on the string
   class; for example, if an identifier used in an application protocol
   is defined as profiling the PRECIS IdentifierClass then a space
   character such as U+0020 would be assigned to ID_DIS, whereas if an
   identifier is defined as profiling the PRECIS FreeformClass then the
   character would be assigned to FREE_PVAL.  For the sake of brevity,
   the designation "FREE_PVAL" is used herein, instead of the longer
   designation "ID_DIS or FREE_PVAL".  In practice, the derived
   properties ID_PVAL and FREE_DIS are not used in this specification,
   since every ID_PVAL code point is PVALID and every FREE_DIS code
   point is DISALLOWED.

   Use of the name of a rule (such as "Exceptions") implies the set of
   code points that the rule defines, whereas the same name as a
   function call (such as "Exceptions(cp)") implies the value that the
   code point has in the Exceptions table.

noToC RFC7564 - Page 22

   The mechanisms described here allow determination of the value of the
   property for future versions of Unicode (including characters added
   after Unicode 5.2 or 7.0 depending on the category, since some
   categories mentioned in this document are simply pointers to IDNA2008
   and therefore were defined at the time of Unicode 5.2).  Changes in
   Unicode properties that do not affect the outcome of this process
   therefore do not affect this framework.  For example, a character can
   have its Unicode General_Category value (at the time of this writing,
   see Chapter 4 of [Unicode7.0]) change from So to Sm, or from Lo to
   Ll, without affecting the algorithm results.  Moreover, even if such
   changes were to result, the BackwardCompatible list (Section 9.7) can
   be adjusted to ensure the stability of the results.

9.  Category Definitions Used to Calculate Derived Property

   The derived property obtains its value based on a two-step procedure:

   1.  Characters are placed in one or more character categories either
       (1) based on core properties defined by the Unicode Standard or
       (2) by treating the code point as an exception and addressing the
       code point based on its code point value.  These categories are
       not mutually exclusive.

   2.  Set operations are used with these categories to determine the
       values for a property specific to a given string class.  These
       operations are specified under Section 8.

      Note: Unicode property names and property value names might have
      short abbreviations, such as "gc" for the General_Category
      property and "Ll" for the Lowercase_Letter property value of the
      gc property.

   In the following specification of character categories, the operation
   that returns the value of a particular Unicode character property for
   a code point is designated by using the formal name of that property
   (from the Unicode PropertyAliases.txt file [PropertyAliases] followed
   by "(cp)" for "code point".  For example, the value of the
   General_Category property for a code point is indicated by
   General_Category(cp).

   The first ten categories (A-J) shown below were previously defined
   for IDNA2008 and are referenced from [RFC5892] to ease the
   understanding of how PRECIS handles various characters.  Some of
   these categories are reused in PRECIS, and some of them are not;
   however, the lettering of categories is retained to prevent overlap
   and to ease implementation of both IDNA2008 and PRECIS in a single
   software application.  The next eight categories (K-R) are specific
   to PRECIS.

noToC RFC7564 - Page 23

9.1.  LetterDigits (A)

   This category is defined in Section 2.1 of [RFC5892] and is included
   by reference for use in PRECIS.

9.2.  Unstable (B)

   This category is defined in Section 2.2 of [RFC5892].  However, it is
   not used in PRECIS.

9.3.  IgnorableProperties (C)

   This category is defined in Section 2.3 of [RFC5892].  However, it is
   not used in PRECIS.

   Note: See the PrecisIgnorableProperties ("M") category below for a
   more inclusive category used in PRECIS identifiers.

9.4.  IgnorableBlocks (D)

   This category is defined in Section 2.4 of [RFC5892].  However, it is
   not used in PRECIS.

9.5.  LDH (E)

   This category is defined in Section 2.5 of [RFC5892].  However, it is
   not used in PRECIS.

   Note: See the ASCII7 ("K") category below for a more inclusive
   category used in PRECIS identifiers.

9.6.  Exceptions (F)

   This category is defined in Section 2.6 of [RFC5892] and is included
   by reference for use in PRECIS.

9.7.  BackwardCompatible (G)

   This category is defined in Section 2.7 of [RFC5892] and is included
   by reference for use in PRECIS.

   Note: Management of this category is handled via the processes
   specified in [RFC5892].  At the time of this writing (and also at the
   time that RFC 5892 was published), this category consisted of the
   empty set; however, that is subject to change as described in
   RFC 5892.

noToC RFC7564 - Page 24

9.8.  JoinControl (H)

   This category is defined in Section 2.8 of [RFC5892] and is included
   by reference for use in PRECIS.

9.9.  OldHangulJamo (I)

   This category is defined in Section 2.9 of [RFC5892] and is included
   by reference for use in PRECIS.

9.10.  Unassigned (J)

   This category is defined in Section 2.10 of [RFC5892] and is included
   by reference for use in PRECIS.

9.11.  ASCII7 (K)

   This PRECIS-specific category consists of all printable, non-space
   characters from the 7-bit ASCII range.  By applying this category,
   the algorithm specified under Section 8 exempts these characters from
   other rules that might be applied during PRECIS processing, on the
   assumption that these code points are in such wide use that
   disallowing them would be counter-productive.

   K: cp is in {0021..007E}

9.12.  Controls (L)

   This PRECIS-specific category consists of all control characters.

   L: Control(cp) = True

9.13.  PrecisIgnorableProperties (M)

   This PRECIS-specific category is used to group code points that are
   discouraged from use in PRECIS string classes.

   M: Default_Ignorable_Code_Point(cp) = True or
      Noncharacter_Code_Point(cp) = True

   The definition for Default_Ignorable_Code_Point can be found in the
   DerivedCoreProperties.txt file [DerivedCoreProperties].

noToC RFC7564 - Page 25

9.14.  Spaces (N)

   This PRECIS-specific category is used to group code points that are
   space characters.

   N: General_Category(cp) is in {Zs}

9.15.  Symbols (O)

   This PRECIS-specific category is used to group code points that are
   symbols.

   O: General_Category(cp) is in {Sm, Sc, Sk, So}

9.16.  Punctuation (P)

   This PRECIS-specific category is used to group code points that are
   punctuation characters.

   P: General_Category(cp) is in {Pc, Pd, Ps, Pe, Pi, Pf, Po}

9.17.  HasCompat (Q)

   This PRECIS-specific category is used to group code points that have
   compatibility equivalents as explained in the Unicode Standard (at
   the time of this writing, see Chapters 2 and 3 of [Unicode7.0]).

   Q: toNFKC(cp) != cp

   The toNFKC() operation returns the code point in normalization
   form KC.  For more information, see Section 5 of Unicode Standard
   Annex #15 [UAX15].

9.18.  OtherLetterDigits (R)

   This PRECIS-specific category is used to group code points that are
   letters and digits other than the "traditional" letters and digits
   grouped under the LetterDigits (A) class (see Section 9.1).

   R: General_Category(cp) is in {Lt, Nl, No, Me}

noToC RFC7564 - Page 26

10.  Guidelines for Designated Experts

   Experience with internationalization in application protocols has
   shown that protocol designers and application developers usually do
   not understand the subtleties and tradeoffs involved with
   internationalization and that they need considerable guidance in
   making reasonable decisions with regard to the options before them.

   Therefore:

   o  Protocol designers are strongly encouraged to question the
      assumption that they need to define new profiles, since existing
      profiles are designed for wide reuse (see Section 5 for further
      discussion).

   o  Those who persist in defining new profiles are strongly encouraged
      to clearly explain a strong justification for doing so, and to
      publish a stable specification that provides all of the
      information described under Section 11.3.

   o  The designated experts for profile registration requests ought to
      seek answers to all of the questions provided under Section 11.3
      and to encourage applicants to provide a stable specification
      documenting the profile (even though the registration policy for
      PRECIS profiles is Expert Review and a stable specification is not
      strictly required).

   o  Developers of applications that use PRECIS are strongly encouraged
      to apply the guidelines provided under Section 6 and to seek out
      the advice of the designated experts or other knowledgeable
      individuals in doing so.

   o  All parties are strongly encouraged to help prevent the
      multiplication of profiles beyond necessity, as described under
      Section 5.1, and to use PRECIS in ways that will minimize user
      confusion and insecure application behavior.

   Internationalization can be difficult and contentious; designated
   experts, profile registrants, and application developers are strongly
   encouraged to work together in a spirit of good faith and mutual
   understanding to achieve rough consensus on profile registration
   requests and the use of PRECIS in particular applications.  They are
   also encouraged to bring additional expertise into the discussion if
   that would be helpful in adding perspective or otherwise resolving
   issues.

(next page on part 3)