tech-invite   World Map     

IETF     RFCs     Groups     SIP     ABNFs    |    3GPP     Specs     Gloss.     Arch.     IMS     UICC    |    Misc.    |    search     info

RFC 7932

 
 
 

Brotli Compressed Data Format

Part 2 of 6, p. 22 to 43
Prev Section       Next Section

 


prevText      Top      ToC       Page 22 
6.  Encoding of Block-Switch Commands

   As described in Section 2, a block-switch command is a pair <block
   type, block count>.  These are encoded in the compressed data part of
   the meta-block, right before the start of each new block of a
   particular block category.

   Each block type in the compressed data is represented with a block
   type code, encoded using a prefix code over the block type code
   alphabet.  A block type symbol 0 means that the new block type is the
   same as the type of the previous block from the same block category,
   i.e., the block type that preceded the current type, while a block
   type symbol 1 means that the new block type equals the current block
   type plus one.  If the current block type is the maximal possible,
   then a block type symbol of 1 results in wrapping to a new block type
   of 0.  Block type symbols 2..257 represent block types 0..255,
   respectively.  The previous and current block types are initialized
   to 1 and 0, respectively, at the end of the meta-block header.

   Since the first block type of each block category is 0, the block
   type of the first block-switch command is not encoded in the
   compressed data.  If a block category has only one block type, the
   block count of the first block-switch command is also omitted from
   the compressed data; otherwise, it is encoded in the meta-block
   header.

   Since the end of the meta-block is detected by the number of
   uncompressed bytes produced, the block counts for any of the three
   categories need not count down to exactly zero at the end of the
   meta-block.

   The number of different block types in each block category, denoted
   by NBLTYPESL, NBLTYPESI, and NBLTYPESD for literals, insert-and-copy
   lengths, and distances, respectively, is encoded in the meta-block
   header, and it must equal to the largest block type plus one in that
   block category.  In other words, the set of literal, insert-and-copy
   length, and distance block types must be [0..NBLTYPESL-1],
   [0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively.  From this it
   follows that the alphabet size of literal, insert-and-copy length,
   and distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2, and
   NBLTYPESD + 2, respectively.

   Each block count in the compressed data is represented with a pair
   <block count code, extra bits>.  The block count code and the extra
   bits are encoded back-to-back, the block count code is encoded using
   a prefix code over the block count code alphabet, while the extra
   bits value is encoded as a fixed-width integer value.  The number of
   extra bits can be 0..24, and it is dependent on the block count code.

Top      Up      ToC       Page 23 
   The symbols of the block count code alphabet along with the number of
   extra bits and the range of block counts are as follows:

           Extra              Extra               Extra
      Code Bits Lengths  Code Bits Lengths   Code Bits Lengths
      ---- ---- -------  ---- ---- -------   ---- ---- -------
       0    2    1..4     9    4   65..80    18    7   369..496
       1    2    5..8    10    4   81..96    19    8   497..752
       2    2    9..12   11    4   97..112   20    9   753..1264
       3    2   13..16   12    5  113..144   21   10   1265..2288
       4    3   17..24   13    5  145..176   22   11   2289..4336
       5    3   25..32   14    5  177..208   23   12   4337..8432
       6    3   33..40   15    5  209..240   24   13   8433..16624
       7    3   41..48   16    6  241..304   25   24   16625..16793840
       8    4   49..64   17    6  305..368

   The first block-switch command of each block category is special in
   the sense that it is encoded in the meta-block header, and as
   described earlier, the block type code is omitted since it is an
   implicit zero.

7.  Context Modeling

   As described in Section 2, the prefix tree used to encode a literal
   byte or a distance code depends on the block type and the context ID.
   This section specifies how to compute the context ID for a particular
   literal and distance code and how to encode the context map that maps
   a <block type, context ID> pair to the index of a prefix code in the
   array of literal and distance prefix codes.

7.1.  Context Modes and Context ID Lookup for Literals

   The context for encoding the next literal is defined by the last two
   bytes in the stream (p1, p2, where p1 is the most recent byte),
   regardless of whether these bytes are produced by uncompressed meta-
   blocks, backward references, static dictionary references, or by
   literal insertions.  At the start of the stream, p1 and p2 are
   initialized to zero.

   There are four methods, called context modes, to compute the Context
   ID:

      *  LSB6, where the Context ID is the value of six least
         significant bits of p1,

      *  MSB6, where the Context ID is the value of six most significant
         bits of p1,

Top      Up      ToC       Page 24 
      *  UTF8, where the Context ID is a complex function of p1, p2,
         optimized for text compression, and

      *  Signed, where Context ID is a complex function of p1, p2,
         optimized for compressing sequences of signed integers.

   The Context ID for the UTF8 and Signed context modes is computed
   using the following lookup tables Lut0, Lut1, and Lut2.

      Lut0 :=
         0,  0,  0,  0,  0,  0,  0,  0,  0,  4,  4,  0,  0,  4,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         8, 12, 16, 12, 12, 20, 12, 16, 24, 28, 12, 12, 32, 12, 36, 12,
        44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 32, 32, 24, 40, 28, 12,
        12, 48, 52, 52, 52, 48, 52, 52, 52, 48, 52, 52, 52, 52, 52, 48,
        52, 52, 52, 52, 52, 48, 52, 52, 52, 52, 52, 24, 12, 28, 12, 12,
        12, 56, 60, 60, 60, 56, 60, 60, 60, 56, 60, 60, 60, 60, 60, 56,
        60, 60, 60, 60, 60, 56, 60, 60, 60, 60, 60, 24, 12, 28, 12,  0,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
         2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
         2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
         2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
         2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3

      Lut1 :=
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,
         1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
         1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2

Top      Up      ToC       Page 25 
      Lut2 :=
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7

   The lengths and the CRC-32 check values (see Appendix C) of each of
   these tables as a sequence of bytes are as follows:

      Table    Length    CRC-32
      -----    ------    ------
      Lut0     256       0x8e91efb7
      Lut1     256       0xd01a32f4
      Lut2     256       0x0dd7a0d6

   Given p1 is the last uncompressed byte and p2 is the second-to-last
   uncompressed byte, the context IDs can be computed as follows:

      For LSB6:    Context ID = p1 & 0x3f
      For MSB6:    Context ID = p1 >> 2
      For UTF8:    Context ID = Lut0[p1] | Lut1[p2]
      For Signed:  Context ID = (Lut2[p1] << 3) | Lut2[p2]

   From the lookup tables defined above and the operations to compute
   the context IDs, we can see that context IDs for literals are in the
   range of 0..63.

   The context modes LSB6, MSB6, UTF8, and Signed are denoted by
   integers 0, 1, 2, 3.

   A context mode is defined for each literal block type and they are
   stored in a consecutive array of bits in the meta-block header,
   always two bits per block type.

Top      Up      ToC       Page 26 
7.2.  Context ID for Distances

   The context for encoding a distance code is defined by the copy
   length corresponding to the distance.  The context IDs are 0, 1, 2,
   and 3 for copy lengths 2, 3, 4, and more than 4, respectively.

7.3.  Encoding of the Context Map

   There are two context maps, one for literals and one for distances.
   The size of the context map is 64 * NBLTYPESL for literals, and 4 *
   NBLTYPESD for distances.  Each value in the context map is an integer
   between 0 and 255, indicating the index of the prefix code to be used
   when encoding the next literal or distance.

   The context maps are two-dimensional matrices, encoded as one-
   dimensional arrays:

      CMAPL[0..(64 * NBLTYPESL - 1)]
      CMAPD[0..(4 * NBLTYPESD - 1)]

   The index of the prefix code for encoding a literal or distance code
   with block type, BTYPE_x, and context ID, CIDx, is:

      index of literal prefix code = CMAPL[64 * BTYPE_L + CIDL]
      index of distance prefix code = CMAPD[4 * BTYPE_D + CIDD]

   The values of the context map are encoded with the combination of run
   length encoding for zero values and prefix coding.  Let RLEMAX denote
   the number of run length codes and NTREES denote the maximum value in
   the context map plus one.  NTREES must equal the number of different
   values in the context map; in other words, the different values in
   the context map must be the [0..NTREES-1] interval.  The alphabet of
   the prefix code has the following RLEMAX + NTREES symbols:

      0: value zero
      1: repeat a zero 2 to 3 times, read 1 bit for repeat length
      2: repeat a zero 4 to 7 times, read 2 bits for repeat length
      ...
      RLEMAX: repeat a zero (1 << RLEMAX) to (1 << (RLEMAX+1))-1
              times, read RLEMAX bits for repeat length
      RLEMAX + 1: value 1
      ...
      RLEMAX + NTREES - 1: value NTREES - 1

Top      Up      ToC       Page 27 
   If RLEMAX = 0, the run length coding is not used and the symbols of
   the alphabet are directly the values in the context map.  We can now
   define the format of the context map (the same format is used for
   literal and distance context maps):

      1..5 bits: RLEMAX, 0 is encoded with one 0 bit, and values 1..16
                 are encoded with bit pattern xxxx1 (so 01001 is 5)

      Prefix code with alphabet size NTREES + RLEMAX

      Context map size values encoded with the above prefix code and run
         length coding for zero values.  If a run length would result in
         more lengths in total than the size of the context map, then
         the stream should be rejected as invalid.

      1 bit:  IMTF bit, if set, we do an inverse move-to-front transform
              on the values in the context map to get the prefix code
              indexes.

   Note that RLEMAX may be larger than the value necessary to represent
   the longest sequence of zero values.  Also, the NTREES value is
   encoded right before the context map as described in Section 9.2.

   We define the inverse move-to-front transform used in this
   specification by the following C language function:

      void InverseMoveToFrontTransform(uint8_t* v, int v_len) {
         uint8_t mtf[256];
         int i;
         for (i = 0; i < 256; ++i) {
            mtf[i] = (uint8_t)i;
         }
         for (i = 0; i < v_len; ++i) {
            uint8_t index = v[i];
            uint8_t value = mtf[index];
            v[i] = value;
            for (; index; --index) {
               mtf[index] = mtf[index - 1];
            }
            mtf[0] = value;
         }
      }

   Note that the inverse move-to-front transform will not produce values
   outside the [0..NTREES-1] interval.

Top      Up      ToC       Page 28 
8.  Static Dictionary

   At any given point during decoding the compressed data, a reference
   to a duplicated string in the uncompressed data produced so far has a
   maximum backward distance value, which is the minimum of the window
   size and the number of uncompressed bytes produced.  However,
   decoding a distance from the compressed stream, as described in
   Section 4, can produce distances that are greater than this maximum
   allowed value.  In this case, the distance is treated as a reference
   to a word in the static dictionary given in Appendix A.  The copy
   length for a static dictionary reference must be between 4 and 24.
   The static dictionary has three parts:

      * DICT[0..DICTSIZE], an array of bytes
      * DOFFSET[0..24], an array of byte-offset values for each length
      * NDBITS[0..24], an array of bit-depth values for each length

   The number of static dictionary words for a given length is:

      NWORDS[length] = 0                       (if length < 4)
      NWORDS[length] = (1 << NDBITS[length])   (if length >= 4)

   DOFFSET and DICTSIZE are defined by the following recursion:

      DOFFSET[0] = 0
      DOFFSET[length + 1] = DOFFSET[length] + length * NWORDS[length]
      DICTSIZE = DOFFSET[24] + 24 * NWORDS[24]

   The offset of a word within the DICT array for a given length and
   index is:

      offset(length, index) = DOFFSET[length] + index * length

   Each static dictionary word has 121 different forms, given by
   applying a word transformation to a base word in the DICT array.  The
   list of word transformations is given in Appendix B.  The static
   dictionary word for a <length, distance> pair can be reconstructed as
   follows:

      word_id = distance - (max allowed distance + 1)
      index = word_id % NWORDS[length]
      base_word = DICT[offset(length, index)..offset(length, index+1)-1]
      transform_id = word_id >> NDBITS[length]

   The string copied to the uncompressed stream is computed by applying
   the transformation to the base dictionary word.  If transform_id is
   greater than 120, or the length is smaller than 4 or greater than 24,
   then the compressed stream should be rejected as invalid.

Top      Up      ToC       Page 29 
   Each word transformation has the following form:

      transform_i(word) = prefix_i + T_i(word) + suffix_i

   where the _i subscript denotes the transform_id above.  Each T_i is
   one of the following 21 elementary transforms:

      Identity, FermentFirst, FermentAll,
      OmitFirst1, ..., OmitFirst9, OmitLast1, ..., OmitLast9

   The form of these elementary transforms is as follows:

      Identity(word) = word

      FermentFirst(word) = see below

      FermentAll(word) = see below

      OmitFirstk(word) = the last (length(word) - k) bytes of word, or
                         empty string if length(word) < k

      OmitLastk(word) = the first (length(word) - k) bytes of word, or
                        empty string if length(word) < k

Top      Up      ToC       Page 30 
   We define the FermentFirst and FermentAll transforms used in this
   specification by the following C language functions:

      int Ferment(uint8_t* word, int word_len, int pos) {
         if (word[pos] < 192) {
            if (word[pos] >= 97 and word[pos] <= 122) {
               word[pos] = word[pos] ^ 32;
            }
            return 1;
         } else if (word[pos] < 224) {
            if (pos + 1 < word_len) {
               word[pos + 1] = word[pos + 1] ^ 32;
            }
            return 2;
         } else {
            if (pos + 2 < word_len) {
               word[pos + 2] = word[pos + 2] ^ 5;
            }
            return 3;
         }
      }

      void FermentFirst(uint8_t* word, int word_len) {
         if (word_len > 0) {
            Ferment(word, word_len, 0);
         }
      }

      void FermentAll(uint8_t* word, int word_len) {
         int i = 0;
         while (i < word_len) {
            i += Ferment(word, word_len, i);
         }
      }

   Appendix B contains the list of transformations by specifying the
   prefix, elementary transform and suffix components of each of them.
   Note that the OmitFirst8 elementary transform is not used in the list
   of transformations.  The strings in Appendix B are in C-string format
   with respect to escape (backslash) characters.

   The maximum number of additional bytes that a transform may add to a
   base word is 13.  Since the largest base word is 24 bytes long, a
   buffer of 38 bytes is sufficient to store any transformed words
   (counting a terminating zero byte).

Top      Up      ToC       Page 31 
9.  Compressed Data Format

   In this section, we describe the format of the compressed data set in
   terms of the format of the individual data items described in the
   previous sections.

9.1.  Format of the Stream Header

   The stream header has only the following one field:

      1..7 bits: WBITS, a value in the range 10..24, encoded with the
                 following variable-length code (as it appears in the
                 compressed data, where the bits are parsed from right
                 to left):

                      Value    Bit Pattern
                      -----    -----------
                         10        0100001
                         11        0110001
                         12        1000001
                         13        1010001
                         14        1100001
                         15        1110001
                         16              0
                         17        0000001
                         18           0011
                         19           0101
                         20           0111
                         21           1001
                         22           1011
                         23           1101
                         24           1111

                 Note that bit pattern 0010001 is invalid and must not
                 be used.

   The size of the sliding window, which is the maximum value of any
   non-dictionary reference backward distance, is given by the following
   formula:

      window size = (1 << WBITS) - 16

Top      Up      ToC       Page 32 
9.2.  Format of the Meta-Block Header

   A compliant compressed data set has at least one meta-block.  Each
   meta-block contains a header with information about the uncompressed
   length of the meta-block, and a bit signaling if the meta-block is
   the last one.  The format of the meta-block header is the following:

      1 bit:  ISLAST, set to 1 if this is the last meta-block

      1 bit:  ISLASTEMPTY, if set to 1, the meta-block is empty; this
              field is only present if ISLAST bit is set -- if it is 1,
              then the meta-block and the brotli stream ends at that
              bit, with any remaining bits in the last byte of the
              compressed stream filled with zeros (if the fill bits are
              not zero, then the stream should be rejected as invalid)

      2 bits: MNIBBLES, number of nibbles to represent the uncompressed
              length, encoded with the following fixed-length code:

                    Value    Bit Pattern
                    -----    -----------
                        0             11
                        4             00
                        5             01
                        6             10

              If MNIBBLES is 0, the meta-block is empty, i.e., it does
              not generate any uncompressed data.  In this case, the
              rest of the meta-block has the following format:

                   1 bit:  reserved, must be zero

                   2 bits: MSKIPBYTES, number of bytes to represent
                           metadata length

                   MSKIPBYTES * 8 bits: MSKIPLEN - 1, where MSKIPLEN is
                           the number of metadata bytes; this field is
                           only present if MSKIPBYTES is positive;
                           otherwise, MSKIPLEN is 0 (if MSKIPBYTES is
                           greater than 1, and the last byte is all
                           zeros, then the stream should be rejected as
                           invalid)

                   0..7 bits: fill bits until the next byte boundary,
                           must be all zeros

                   MSKIPLEN bytes of metadata, not part of the
                           uncompressed data or the sliding window

Top      Up      ToC       Page 33 
      MNIBBLES * 4 bits: MLEN - 1, where MLEN is the length of the meta-
              block uncompressed data in bytes (if MNIBBLES is greater
              than 4, and the last nibble is all zeros, then the stream
              should be rejected as invalid)

      1 bit:  ISUNCOMPRESSED, if set to 1, any bits of compressed data
              up to the next byte boundary are ignored, and the rest of
              the meta-block contains MLEN bytes of literal data; this
              field is only present if the ISLAST bit is not set (if the
              ignored bits are not all zeros, the stream should be
              rejected as invalid)

      1..11 bits: NBLTYPESL, number of literal block types, encoded with
              the following variable-length code (as it appears in the
              compressed data, where the bits are parsed from right to
              left, so 0110111 has the value 12):

                       Value    Bit Pattern
                       -----    -----------
                         1                0
                         2             0001
                       3..4           x0011
                       5..8          xx0101
                       9..16        xxx0111
                      17..32       xxxx1001
                      33..64      xxxxx1011
                      65..128    xxxxxx1101
                     129..256   xxxxxxx1111

         Prefix code over the block type code alphabet for literal block
            types, appears only if NBLTYPESL >= 2

         Prefix code over the block count code alphabet for literal
            block counts, appears only if NBLTYPESL >= 2

         Block count code + extra bits for first literal block count,
            appears only if NBLTYPESL >= 2

      1..11 bits: NBLTYPESI, number of insert-and-copy block types,
                  encoded with the same variable-length code as above

         Prefix code over the block type code alphabet for insert-and-
            copy block types, appears only if NBLTYPESI >= 2

         Prefix code over the block count code alphabet for insert-and-
            copy block counts, appears only if NBLTYPESI >= 2

Top      Up      ToC       Page 34 
         Block count code + extra bits for first insert-and-copy block
            count, appears only if NBLTYPESI >= 2

      1..11 bits: NBLTYPESD, number of distance block types, encoded
                  with the same variable-length code as above

         Prefix code over the block type code alphabet for distance
            block types, appears only if NBLTYPESD >= 2

         Prefix code over the block count code alphabet for distance
            block counts, appears only if NBLTYPESD >= 2

         Block count code + extra bits for first distance block count,
            appears only if NBLTYPESD >= 2

      2 bits: NPOSTFIX, parameter used in the distance coding

      4 bits: four most significant bits of NDIRECT, to get the actual
              value of the parameter NDIRECT, left-shift this four-bit
              number by NPOSTFIX bits

      NBLTYPESL * 2 bits: context mode for each literal block type

      1..11 bits: NTREESL, number of literal prefix trees, encoded with
                  the same variable-length code as NBLTYPESL

         Literal context map, encoded as described in Section 7.3,
            appears only if NTREESL >= 2; otherwise, the context map has
            only zero values

      1..11 bits: NTREESD, number of distance prefix trees, encoded with
                  the same variable-length code as NBLTYPESD

         Distance context map, encoded as described in Section 7.3,
            appears only if NTREESD >= 2; otherwise, the context map has
            only zero values

      NTREESL prefix codes for literals

      NBLTYPESI prefix codes for insert-and-copy lengths

      NTREESD prefix codes for distances

Top      Up      ToC       Page 35 
9.3.  Format of the Meta-Block Data

   The compressed data part of a meta-block consists of a series of
   commands.  Each command has the following format:

      Block type code for next insert-and-copy block type, appears only
         if NBLTYPESI >= 2 and the previous insert-and-copy block count
         is zero

      Block count code + extra bits for next insert-and-copy block
         count, appears only if NBLTYPESI >= 2 and the previous insert-
         and-copy block count is zero

      Insert-and-copy length, encoded as in Section 5, using the insert-
         and-copy length prefix code with the current insert-and-copy
         block type index

      Insert length number of literals, with the following format:

         Block type code for next literal block type, appears only if
            NBLTYPESL >= 2 and the previous literal block count is zero

         Block count code + extra bits for next literal block count,
            appears only if NBLTYPESL >= 2 and the previous literal
            block count is zero

         Next byte of the uncompressed data, encoded with the literal
            prefix code with the index determined by the previous two
            bytes of the uncompressed data, the current literal block
            type, and the context map, as described in Section 7.3

      Block type code for next distance block type, appears only if
         NBLTYPESD >= 2 and the previous distance block count is zero

      Block count code + extra bits for next distance block count,
         appears only if NBLTYPESD >= 2 and the previous distance block
         count is zero

      Distance code, encoded as in Section 4, using the distance prefix
         code with the index determined by the copy length, the current
         distance block type, and the distance context map, as described
         in Section 7.3, appears only if the distance code is not an
         implicit 0, as indicated by the insert-and-copy length code

Top      Up      ToC       Page 36 
   The number of commands in the meta-block is such that the sum of the
   uncompressed bytes produced (i.e., the number of literals inserted
   plus the number of bytes copied from past data or generated from the
   static dictionary) over all the commands gives the uncompressed
   length, MLEN encoded in the meta-block header.

   If the total number of uncompressed bytes produced after the insert
   part of the last command equals MLEN, then the copy length of the
   last command is ignored and will not produce any uncompressed output.
   In this case, the copy length of the last command can have any value.
   In any other case, if the number of literals to insert, the copy
   length, or the resulting dictionary word length would cause MLEN to
   be exceeded, then the stream should be rejected as invalid.

   If the last command of the last non-empty meta-block does not end on
   a byte boundary, the unused bits in the last byte must be zeros.

10.  Decoding Algorithm

   The decoding algorithm that produces the uncompressed data is as
   follows:

      read window size
      do
         read ISLAST bit
         if ISLAST
            read ISLASTEMPTY bit
            if ISLASTEMPTY
               break from loop
         read MNIBBLES
         if MNIBBLES is zero
            verify reserved bit is zero
            read MSKIPLEN
            skip any bits up to the next byte boundary
            skip MSKIPLEN bytes
            continue to the next meta-block
         else
            read MLEN
         if not ISLAST
            read ISUNCOMPRESSED bit
            if ISUNCOMPRESSED
               skip any bits up to the next byte boundary
               copy MLEN bytes of compressed data as literals
               continue to the next meta-block

Top      Up      ToC       Page 37 
         loop for each three block categories (i = L, I, D)
            read NBLTYPESi
            if NBLTYPESi >= 2
               read prefix code for block types, HTREE_BTYPE_i
               read prefix code for block counts, HTREE_BLEN_i
               read block count, BLEN_i
               set block type, BTYPE_i to 0
               initialize second-to-last and last block types to 0 and 1
            else
               set block type, BTYPE_i to 0
               set block count, BLEN_i to 16777216
         read NPOSTFIX and NDIRECT
         read array of literal context modes, CMODE[]
         read NTREESL
         if NTREESL >= 2
            read literal context map, CMAPL[]
         else
            fill CMAPL[] with zeros
         read NTREESD
         if NTREESD >= 2
            read distance context map, CMAPD[]
         else
            fill CMAPD[] with zeros
         read array of literal prefix codes, HTREEL[]
         read array of insert-and-copy length prefix codes, HTREEI[]
         read array of distance prefix codes, HTREED[]
         do
            if BLEN_I is zero
               read block type using HTREE_BTYPE_I and set BTYPE_I
                  save previous block type
               read block count using HTREE_BLEN_I and set BLEN_I
            decrement BLEN_I
            read insert-and-copy length symbol using HTREEI[BTYPE_I]
            compute insert length, ILEN, and copy length, CLEN
            loop for ILEN
               if BLEN_L is zero
                  read block type using HTREE_BTYPE_L and set BTYPE_L
                     save previous block type
                  read block count using HTREE_BLEN_L and set BLEN_L
               decrement BLEN_L
               look up context mode CMODE[BTYPE_L]
               compute context ID, CIDL from last two uncompressed bytes
               read literal using HTREEL[CMAPL[64*BTYPE_L + CIDL]]
               write literal to uncompressed stream
            if number of uncompressed bytes produced in the loop for
               this meta-block is MLEN, then break from loop (in this
               case the copy length is ignored and can have any value)

Top      Up      ToC       Page 38 
            if distance code is implicit zero from insert-and-copy code
               set backward distance to the last distance
            else
               if BLEN_D is zero
                  read block type using HTREE_BTYPE_D and set BTYPE_D
                     save previous block type
                  read block count using HTREE_BLEN_D and set BLEN_D
               decrement BLEN_D
               compute context ID, CIDD from CLEN
               read distance code using HTREED[CMAPD[4*BTYPE_D + CIDD]]
               compute distance by distance short code substitution
               if distance code is not zero,
                  and distance is not a static dictionary reference,
                  push distance to the ring buffer of last distances
            if distance is less than the max allowed distance plus one
               move backwards distance bytes in the uncompressed data,
               and copy CLEN bytes from this position to
               the uncompressed stream
            else
               look up the static dictionary word, transform the word as
               directed, and copy the result to the uncompressed stream
         while number of uncompressed bytes for this meta-block < MLEN
      while not ISLAST

   If the stream ends before the completion of the last meta-block, then
   the stream should be rejected as invalid.

   Note that a duplicated string reference may refer to a string in a
   previous meta-block, i.e., the backward distance may cross one or
   more meta-block boundaries.  However, a backward copy distance will
   not refer past the beginning of the uncompressed stream or the window
   size; any such distance is interpreted as a reference to a static
   dictionary word.  Also, note that the referenced string may overlap
   the current position, for example, if the last 2 bytes decoded have
   values X and Y, a string reference with <length = 5, distance = 2>
   adds X,Y,X,Y,X to the uncompressed stream.

11.  Considerations for Compressor Implementations

   Since the intent of this document is to define the brotli compressed
   data format without reference to any particular compression
   algorithm, the material in this section is not part of the definition
   of the format, and a compressor need not follow it in order to be
   compliant.

Top      Up      ToC       Page 39 
11.1.  Trivial Compressor

   In this section, we present a very simple algorithm that produces a
   valid brotli stream representing an arbitrary sequence of
   uncompressed bytes in the form of the following C++ language
   function.

      string BrotliCompressTrivial(const string& u) {
         if (u.empty()) {
            return string(1, 6);
         }
         int i;
         string c;
         c.append(1, 12);
         for (i = 0; i + 65535 < u.size(); i += 65536) {
            c.append(1, 248);
            c.append(1, 255);
            c.append(1, 15);
            c.append(&u[i], 65536);
         }
         if (i < u.size()) {
            int r = u.size() - i - 1;
            c.append(1, (r & 31) << 3);
            c.append(1, r >> 5);
            c.append(1, 8 + (r >> 13));
            c.append(&u[i], r + 1);
         }
         c.append(1, 3);
         return c;
      }

   Note that this simple algorithm does not actually compress data, that
   is, the brotli representation will always be bigger than the
   original, but it shows that every sequence of N uncompressed bytes
   can be represented with a valid brotli stream that is not longer than
   N + (3 * (N >> 16) + 5) bytes.

11.2.  Aligning Compressed Meta-Blocks to Byte Boundaries

   As described in Section 9, only those meta-blocks that immediately
   follow an uncompressed meta-block or a metadata meta-block are
   guaranteed to start on a byte boundary.  In some applications, it
   might be required that every non-metadata meta-block starts on a byte
   boundary.  This can be achieved by appending an empty metadata meta-
   block after every non-metadata meta-block that does not end on a byte
   boundary.

Top      Up      ToC       Page 40 
11.3.  Creating Self-Contained Parts within the Compressed Data

   In some encoder implementations, it might be required to make a
   sequence of bytes within a brotli stream self-contained, that is,
   such that they can be decompressed independently from previous parts
   of the compressed data.  This is a useful feature for three reasons.
   First, if a large compressed file is damaged, it is possible to
   recover some of the file after the damage.  Second, it is useful when
   doing differential transfer of compressed data.  If a sequence of
   uncompressed bytes is unchanged and compressed independently from
   previous data, then the compressed representation may also be
   unchanged and can therefore be transferred very cheaply.  Third, if
   sequences of uncompressed bytes are compressed independently, it
   allows for parallel compression of these byte sequences within the
   same file, in addition to parallel compression of multiple files.

   Given two sequences of uncompressed bytes, U0 and U1, we will now
   describe how to create two sequences of compressed bytes, C0 and C1,
   such that the concatenation of C0 and C1 is a valid brotli stream,
   and that C0 and C1 (together with the first byte of C0 that contains
   the window size) can be decompressed independently from each other to
   U0 and U1.

   When compressing the byte sequence U0 to produce C0, we can use any
   compressor that works on the complete set of uncompressed bytes U0,
   with the following two changes.  First, the ISLAST bit of the last
   meta-block of C0 must not be set.  Second, C0 must end at a byte-
   boundary, which can be ensured by appending an empty metadata meta-
   block to it, as in Section 11.2.

   When compressing the byte sequence U1 to produce C1, we can use any
   compressor that starts a new meta-block at the beginning of U1 within
   the U0+U1 input stream, with the following two changes.  First,
   backward distances in C1 must not refer to static dictionary words or
   uncompressed bytes in U0.  Even if a sequence of bytes in U1 would
   match a static dictionary word, or a sequence of bytes that overlaps
   U0, the compressor must represent this sequence of bytes with a
   combination of literal insertions and backward references to bytes in
   U1 instead.  Second, the ring buffer of last four distances must be
   replenished first with distances in C1 before using it to encode
   other distances in C1.  Note that both compressors producing C0 and
   C1 have to use the same window size, but the stream header is emitted
   only by the compressor that produces C0.

   Note that this method can be easily generalized to more than two
   sequences of uncompressed bytes.

Top      Up      ToC       Page 41 
12.  Security Considerations

   As with any compressed file formats, decompressor implementations
   should handle all compressed data byte sequences, not only those that
   conform to this specification, where non-conformant compressed data
   sequences should be rejected as invalid.

   A possible attack against a system containing a decompressor
   implementation (e.g., a web browser) is to exploit a buffer overflow
   triggered by invalid compressed data.  Therefore, decompressor
   implementations should perform bounds-checking for each memory access
   that result from values decoded from the compressed stream and
   derivatives thereof.

   Another possible attack against a system containing a decompressor
   implementation is to provide it (either valid or invalid) compressed
   data that can make the decompressor system's resource consumption
   (CPU, memory, or storage) to be disproportionately large compared to
   the size of the compressed data.  In addition to the size of the
   compressed data, the amount of CPU, memory, and storage required to
   decompress a single compressed meta-block within a brotli stream is
   controlled by the following two parameters: the size of the
   uncompressed meta-block, which is encoded at the start of the
   compressed meta-block, and the size of the sliding window, which is
   encoded at the start of the brotli stream.  Decompressor
   implementations in systems where memory or storage is constrained
   should perform a sanity-check on these two parameters.  The
   uncompressed meta-block size that was decoded from the compressed
   stream should be compared against either a hard limit, given by the
   system's constraints or some expectation about the uncompressed data,
   or against a certain multiple of the size of the compressed data.  If
   the uncompressed meta-block size is determined to be too high, the
   compressed data should be rejected.  Likewise, when the complete
   uncompressed stream is kept in the system containing the decompressor
   implementation, the total uncompressed size of the stream should be
   checked before decompressing each additional meta-block.  If the size
   of the sliding window that was decoded from the start of the
   compressed stream is greater than a certain soft limit, then the
   decompressor implementation should, at first, allocate a smaller
   sliding window that fits the first uncompressed meta-block, and
   afterwards, before decompressing each additional meta-block, it
   should increase the size of the sliding window until the sliding
   window size specified in the compressed data is reached.

Top      Up      ToC       Page 42 
   Correspondingly, possible attacks against a system containing a
   compressor implementation (e.g., a web server) are to exploit a
   buffer overflow or cause disproportionately large resource
   consumption by providing, e.g., uncompressible data.  As described in
   Section 11.1, an output buffer of

            S(N) = N + (3 * (N >> 16) + 5)

   bytes is sufficient to hold a valid compressed brotli stream
   representing an arbitrary sequence of N uncompressed bytes.
   Therefore, compressor implementations should allocate at least S(N)
   bytes of output buffer before compressing N bytes of data with
   unknown compressibility and should perform bounds-checking for each
   write into this output buffer.  If their output buffer is full,
   compressor implementations should revert to the trivial compression
   algorithm described in Section 11.1.  The resource consumption of a
   compressor implementation for a particular input data depends mostly
   on the algorithm used to find backward matches and on the algorithm
   used to construct context maps and prefix codes and only to a lesser
   extent on the input data itself.  If the system containing a
   compressor implementation is overloaded, a possible way to reduce
   resource usage is to switch to more simple algorithms for backward
   reference search and prefix code construction, or to fall back to the
   trivial compression algorithm described in Section 11.1.

   A possible attack against a system that sends compressed data over an
   encrypted channel is the following.  An attacker who can repeatedly
   mix arbitrary (attacker-supplied) data with secret data (passwords,
   cookies) and observe the length of the ciphertext can potentially
   reconstruct the secret data.  To protect against this kind of attack,
   applications should not mix sensitive data with non-sensitive,
   potentially attacker-supplied data in the same compressed stream.

13.  IANA Considerations

   The "HTTP Content Coding Registry" has been updated with the
   registration below:

      +-------+-------------------------------------+------------+
      | Name  | Description                         | Reference  |
      +-------+-------------------------------------+------------+
      | br    | Brotli Compressed Data Format       | RFC 7932   |
      +-------+-------------------------------------+------------+

Top      Up      ToC       Page 43 
14.  Informative References

   [HUFFMAN]  Huffman, D. A., "A Method for the Construction of Minimum
              Redundancy Codes", Proceedings of the Institute of Radio
              Engineers, September 1952, Vol. 40, No. 9, pp. 1098-1101.

   [LZ77]     Ziv, J. and A. Lempel, "A Universal Algorithm for
              Sequential Data Compression", IEEE Transactions on
              Information Theory, Vol. 23, No. 3, pp. 337-343,
              DOI 10.1109/TIT.1977.1055714, May 1977,
              <https://www.cs.duke.edu/courses/spring03/cps296.5/papers/
              ziv_lempel_1977_universal_algorithm.pdf>.

   [RFC1951]  Deutsch, P., "DEFLATE Compressed Data Format Specification
              version 1.3", RFC 1951, DOI 10.17487/RFC1951, May 1996,
              <http://www.rfc-editor.org/info/rfc1951>.

   [WOFF2]    Levantovsky, V., Ed., and R. Levien, Ed., "WOFF File
              Format 2.0", W3C Candidate Recommendation, March 2016,
              <http://www.w3.org/TR/WOFF2/>.


Next Section