This section describes the details related to creating a canonical JSON representation and how they are addressed by JCS.
describes the RECOMMENDED
way of adding JCS support to existing JSON tools.
Data to be canonically serialized is usually created by:
Parsing previously generated JSON data.
Programmatically creating data.
Irrespective of the method used, the data to be serialized MUST
be adapted for I-JSON [RFC 7493
] formatting, which implies the following:
JSON objects MUST NOT exhibit duplicate property names.
JSON string data MUST be expressible as Unicode [UNICODE].
JSON number data MUST be expressible as IEEE 754 [IEEE754] double-precision values. For applications needing higher precision or longer integers than offered by IEEE 754 double precision, it is RECOMMENDED to represent such numbers as JSON strings; see Appendix D for details on how this can be performed in an interoperable and extensible way.
An additional constraint is that parsed JSON string data MUST NOT
be altered during subsequent serializations. For more information, see Appendix E
Note: Although the Unicode standard offers the possibility of rearranging certain character sequences, referred to as "Unicode Normalization" [UCNORM
], JCS-compliant string processing does not take this into consideration. That is, all components involved in a scheme depending on JCS MUST
preserve Unicode string data "as is".
The following subsections describe the steps required to create a canonical JSON representation of the data elaborated on in the previous section.
shows sample code for an ECMAScript-based canonicalizer, matching the JCS specification.
Whitespace between JSON tokens MUST NOT
Assume the following JSON object is parsed:
"numbers": [333333333.33333329, 1E30, 4.50,
"literals": [null, true, false]
If the parsed data is subsequently serialized using a serializer compliant with ECMAScript's "JSON.stringify()", the result would (with a line wrap added for display purposes only) be rather divergent with respect to the original data:
The reason for the difference between the parsed data and its serialized counterpart is due to a wide tolerance on input data (as defined by JSON [RFC 8259
]), while output data (as defined by ECMAScript) has a fixed representation. As can be seen in the example, numbers are subject to rounding as well.
The following subsections describe the serialization of primitive JSON data types according to JCS. This part is identical to that of ECMAScript. In the (unlikely) event that a future version of ECMAScript would invalidate any of the following serialization methods, it will be up to the developer community to either stick to this specification or create a new specification.
In accordance with JSON [RFC 8259
], the literals "null", "true", and "false" MUST
be serialized as null, true, and false, respectively.
For JSON string data (which includes JSON object property names as well), each Unicode code point MUST
be serialized as described below (see Section 220.127.116.11 of [ECMA-262
If the Unicode value falls within the traditional ASCII control character range (U+0000 through U+001F), it MUST be serialized using lowercase hexadecimal Unicode notation (\uhhhh) unless it is in the set of predefined JSON control characters U+0008, U+0009, U+000A, U+000C, or U+000D, which MUST be serialized as \b, \t, \n, \f, and \r, respectively.
If the Unicode value is outside of the ASCII control character range, it MUST be serialized "as is" unless it is equivalent to U+005C (\) or U+0022 ("), which MUST be serialized as \\ and \", respectively.
Finally, the resulting sequence of Unicode code points MUST
be enclosed in double quotes (").
Note: Since invalid Unicode data like "lone surrogates" (e.g., U+DEAD) may lead to interoperability issues including broken signatures, occurrences of such data MUST
cause a compliant JCS implementation to terminate with an appropriate error.
ECMAScript builds on the IEEE 754 [IEEE754
] double-precision standard for representing JSON number data. Such data MUST
be serialized according to Section 18.104.22.168 of [ECMA-262
], including the "Note 2" enhancement.
Due to the relative complexity of this part, the algorithm itself is not included in this document. For implementers of JCS-compliant number serialization, Google's implementation in V8 [V8
] may serve as a reference. Another compatible number serialization reference implementation is Ryu [RYU
], which is used by the JCS open-source Java implementation mentioned in Appendix G
. Appendix B
holds a set of IEEE 754 sample values and their corresponding JSON serialization.
Note: Since Not a Number (NaN) and Infinity are not permitted in JSON, occurrences of NaN or Infinity MUST
cause a compliant JCS implementation to terminate with an appropriate error.
Although the previous step normalized the representation of primitive JSON data types, the result would not yet qualify as "canonical" since JSON object properties are not in lexicographic (alphabetical) order.
Applied to the sample in Section 3.2.2
, a properly canonicalized version should (with a line wrap added for display purposes only) read as:
The rules for lexicographic sorting of JSON object properties according to JCS are as follows:
JSON object properties MUST be sorted recursively, which means that JSON child Objects MUST have their properties sorted as well.
JSON array data MUST also be scanned for the presence of JSON objects (if an object is found, then its properties MUST be sorted), but array element order MUST NOT be changed.
When a JSON object is about to have its properties sorted, the following measures MUST
be adhered to:
The sorting process is applied to property name strings in their "raw" (unescaped) form. That is, a newline character is treated as U+000A.
Property name strings to be sorted are formatted as arrays of UTF-16 [UNICODE] code units. The sorting is based on pure value comparisons, where code units are treated as unsigned integers, independent of locale settings.
Property name strings either have different values at some index that is a valid index for both strings, or their lengths are different, or both. If they have different values at one or more index positions, let k be the smallest such index; then, the string whose value at position k has the smaller value, as determined by using the "<" operator, lexicographically precedes the other string. If there is no index position at which they differ, then the shorter string lexicographically precedes the longer string.
In plain English, this means that property names are sorted in ascending order like the following:
The rationale for basing the sorting algorithm on UTF-16 code units is that it maps directly to the string type in ECMAScript (featured in web browsers and Node.js), Java, and .NET. In addition, JSON only supports escape sequences expressed as UTF-16 code units, making knowledge and handling of such data a necessity anyway. Systems using another internal representation of string data will need to convert JSON property name strings into arrays of UTF-16 code units before sorting. The conversion from UTF-8 or UTF-32 to UTF-16 is defined by the Unicode [UNICODE
The following JSON test data can be used for verifying the correctness of the sorting scheme in a JCS implementation:
"\u20ac": "Euro Sign",
"\r": "Carriage Return",
"\ufb33": "Hebrew Letter Dalet With Dagesh",
"\ud83d\ude00": "Emoji: Grinning Face",
"\u00f6": "Latin Small Letter O With Diaeresis"
Expected argument order after sorting property strings:
"Latin Small Letter O With Diaeresis"
"Emoji: Grinning Face"
"Hebrew Letter Dalet With Dagesh"
Note: For the purpose of obtaining a deterministic property order, sorting of data encoded in UTF-8 or UTF-32 would also work, but the outcome for JSON data like above would differ and thus be incompatible with this specification. However, in practice, property names are rarely defined outside of 7-bit ASCII, making it possible to sort string data in UTF-8 or UTF-32 format without conversion to UTF-16 and still be compatible with JCS. Whether or not this is a viable option depends on the environment JCS is used in.
Finally, in order to create a platform-independent representation, the result of the preceding step MUST
be encoded in UTF-8.
Applied to the sample in Section 3.2.3
, this should yield the following bytes, here shown in hexadecimal notation:
7b 22 6c 69 74 65 72 61 6c 73 22 3a 5b 6e 75 6c 6c 2c 74 72
75 65 2c 66 61 6c 73 65 5d 2c 22 6e 75 6d 62 65 72 73 22 3a
5b 33 33 33 33 33 33 33 33 33 2e 33 33 33 33 33 33 33 2c 31
65 2b 33 30 2c 34 2e 35 2c 30 2e 30 30 32 2c 31 65 2d 32 37
5d 2c 22 73 74 72 69 6e 67 22 3a 22 e2 82 ac 24 5c 75 30 30
30 66 5c 6e 41 27 42 5c 22 5c 5c 5c 5c 5c 22 2f 22 7d
This data is intended to be usable as input to cryptographic methods.