Tech-invite3GPPspaceIETFspace
21222324252627282931323334353637384‑5x

Content for  TS 23.042  Word version:  19.0.0

Top   Top   Up   Prev   Next
0…   4…   5…   6…   A…   B…   C…   R…

 

5  Compressed Data Streamsp. 10

This clause provides:
  • A detailed specification of the algorithms and data structures that implement compression and decompression mechanisms.

5.1  Structurep. 10

A Compressed Data Stream (CDS) comprises three key components:
  • a Compression Header (CH) containing a variable number of octets, the content of which defines the nature of the compressed data;
  • the Compressed Data (CD) which is a bit stream of variable length;
  • a Compression Footer (CF) which is used to signal the number of bits in the last octet of the CDS that form part of the compressed data.

5.2  Compression Headerp. 11

The Compression Header (CH) comprises a variable number of octets that define the nature of the compressed data.
The compression header allows for a wide range of compression alternatives, however of these alternatives only one is defined as the basic mandatory form of compression that shall be supported by all implementations. This is the use of the basic Huffman algorithm initialized with no prior knowledge of character distribution. This case can be signalled directly by setting a single octet(octet 1) for the compression header with the value of 120 (decimal).
Up

5.2.1  Compression Header - Octet 1p. 11

The first CH octet is mandatory and is defined as follows:
7 6 5 4 3 2 1 0 Description
0There is no subsequent CH octet
1A further CH octet follows
nnnn The "Compression Language Context" this is described below
0Punctuation processing disabled
1Punctuation processing enabled
0Keyword processing disabled
1Keyword processing enabled
0Character group processing disabled
1Character group processing enabled
As noted in clause 4, the compression algorithms can be configured to operate in a variety of ways and may rely on end-to-end knowledge of "prior" information such as which key word dictionary is to be used.
A requirement that all configuration information be explicitly stated in the CH is less efficient (in terms of compression ratio) than if a default configuration is known and only variations from this need be signalled. However, a major determinant of configuration is the language in which the original message to be compressed is composed. For example, different keyword dictionaries would be required for French and opposed to German and character frequency distributions for English texts may vary greatly from those for Swedish texts. From this it can be seen that a universal "default" configuration would be of little value.
To address this, the Compression Language Context (CLC) allows a default configuration to be specified for each of the languages defined in TS 23.038 in relation to the Cell Broadcast Data Coding Scheme as follows:
  • The CLC in bits 6 to 3 of the CH specify the language as per TS 23.038 in the case where bits 7 to 4 of the Cell Broadcast Data Coding Scheme octet are set to 0000.
  • If and when required, higher order bits of the CLC can be signalled by a subsequent CH octet as described below.
  • The CLC value 1111 (language unspecified) will indicate a "default" configuration that is language independent. This is specified in Annex R and involves the basic Huffman (de-)coding with no initial character frequency distribution, see example below.
7 6 5 4 3 2 1 0 Description
01111000Basic Huffman (de-)coding only.
Up

5.2.2  Compression Header - Octets 2 to np. 12

Any second and subsequent CH octets are used to vary the configuration defaults established by the CLC. These octets all comprise a continuation bit followed by a Type, Value structure as follows:
7 6 5 4 3 2 1 0 Description
0There is no subsequent CH octet
1A further CH octet follows
nnnCH Extension Type
nnnnCH Extension Value
The bits of the semi-octet CH Extension value are interpreted left to right, MSB to LSB. If the CH contains more than one octet of the same CH Extension type, the CH Extension value of a subsequent CH octet, is interpreted as being next most significant semi-octet of the composite value being signalled.
For example if the CLC in CH octet 1 indicates that the default Huffman Initialization ID is 1 (decimal) and the required HI-ID is 37 (decimal), then the following octets (in the range 2 to n) would also be required in the CH.
7 6 5 4 3 2 1 0 Description
10110101The default HI-ID is replaced with the value 0101
00110010The current HI-ID value (0101) is extended to 0010 0101
The following values are defined for the CH Extension Type:
000
Extend CLC. The CH Extension Value contains higher order bits that are to be pre-pended to the current CLC value.
7 6 5 4 3 2 1 0 Description
10010The least significant semi-octet of the CLC is 0010
00000001The CLC value (0010) is extended to 0001 0010
001
Change Character Set. The CLC defines a default character set (UCS2 or otherwise) within which compression will operate. The Change Character Set CH Extension Type indicates that this should be overridden by the character set specified by the CH Extension Value. If a CH contains more than one Change Character Set CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits and are to be pre-pended to the value of the new character set.
The following Character Sets are defined:
0000
No character set defined. To be used where original message content is binary data and compression is solely via Huffman coding with no initial frequency training and thus there is no requirement to ensure consistent use of character set by coder and decoder.
0001
GSM 7 bit default alphabet (TS 23.038)
0010
Codepage 437
0011
Codepage 850
All other values are reserved - see clause 5.2.2.1
A Change Character Set to UCS2 codepoint is not defined here. Where the CLC indicates a character set other than UCS2 and there is a need to change to UCS2 then this is achieved using the Change UCS2 row parameter described below.
010
Change UCS2 Row. The CLC defines a default character set (UCS2 or otherwise) within which compression will operate. The Change UCS2 Row CH Extension Type indicates that this should be overridden by the use of UCS2 and the UCS2 row value for the first character in the input stream is that specified by the CH Extension Value. If a CH contains more than one Change UCS2 Row CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial UCS2 Row value and are to be pre-pended to the current value.
011
Change Huffman Initialization. The CLC defines a default set of parameters for the initialization of the Huffman (de)coder. The Change Huffman Initialization CH Extension Type indicates that this should be overridden by the set of initialization parameters identified by the Huffman Initialization ID contained in the CH Extension Value. If a CH contains more than one Change Huffman Initialization CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial Huffman Initialization ID value and are to be pre-pended to the current value.
100
Change Keyword Dictionary. The CLC defines a default set of parameters for the initialization of the Keyword (de)coder. The Change Keyword Dictionary CH Extension Type indicates that this should be overridden by the set of initialization parameters identified by the Keyword Dictionary ID contained in the CH Extension Value. If a CH contains more than one Change Keyword Dictionary CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial Keyword Dictionary ID value and are to be pre-pended to the current value.
101
Change Punctuator. The CLC defines a default set of parameters for the initialization of the punctuation (de)coder. The Change Punctuator CH Extension Type indicates that this should be overridden by the set of initialization parameters identified by the Punctuator ID contained in the CH Extension Value. If a CH contains more than one Punctuator CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial Punctuator ID value and are to be pre-pended to the current value.
110
Change Character Group. The CLC defines a default set of parameters for the initialization of the Character Group (de)coder. The Change Character Group CH Extension Type indicates that this should be overridden by the set of initialization parameters identified by the Character Group ID contained in the CH Extension Value. If a CH contains more than one Change Character Group CH Extension Type octet, the CH Extension Value contained in subsequent CH octets of this type contains higher order bits for the initial Character Group ID value and are to be pre-pended to the current value.
111
Reserved, see clause 5.2.2.1
Up

5.2.2.1  Compression Header reserved extension types and valuesp. 13

Any currently undefined values in the range 0 to 255 decimal are reserved.
Values above 255 are available for user to user requirements.

5.2.3  Identifying unique parameter setsp. 13

The four component compression algorithms (Huffman, Keywords, Character Groups and Punctuation) may all have a variety of initialization options. For each algorithm, a given set of initialization options needs to be identified for the processing of a given input stream.
Initialization and operation of the algorithms depends not only on the language in which the original source text is composed but also the character set (UCS2 or otherwise) that is to be used during processing. Thus the Huffman Initialization ID (HI-ID), Keyword Dictionary ID (KD-ID), Punctuator ID (PU-ID) and Character Group ID (CG-ID) only define unique values within the context of a given character set (the default established by the CLC or subsequently amended via Change Character Set or Change UCS2 Row CH Extension types) and within the context of the language indicated by the CLC.
Up

5.3  Compressed Datap. 14

The Compressed Data (CD) is a stream bits of variable length that represent either an encoding of the content original input stream or control information indication that the operation of some algorithm should vary in some manner.
Control information is signalled within the CD by Huffman encoded symbols (characters) whose value is greater than 255 decimal. Huffman encoded symbols in the range 0 to 255 are of course characters from the original input stream.
The following control symbols are defined:
Decimal value Significance
256New 7 bit character.
On encoding, if a character (octet) from the input stream in the range 0 to 127 does not exist in the Huffman tree, then the New 7 bit character symbol is Huffman encoded to the CD and bits 6 to 0 of the original octet are copied unchanged to the CD. The Huffman tree would then be updated to include the new character as described in the sections below.
On decoding the New 7 bit character symbol, the symbol itself is discarded and the next 7 bits of the CD are copied unchanged to bits 6-0 of the octet to be output, bit 7 of which is zero. The Huffman tree would then be updated to include the new character.
257New 8 bit character.
The operation of this is identical to that of the New 7 bit character except that on encoding, the input character is in the range 128-255 and on decoding, bit 7 of the output character is set to 1.
258Keyword.
This symbol (Huffman encoded) prefixes a sequence of bits of variable length in the CD that define a representation of characters in the uncompressed stream by an entry in a keyword dictionary.
On encoding, if a sequence of characters in the input stream can be represented by an entry in a keyword dictionary, the Keyword symbol is Huffman encoded to the CD followed by the bit sequence describing the keyword entry (this is described below).
On decoding the Keyword symbol, the symbol itself is discarded and the bit sequence describing the keyword entry is passed to the Keyword processor to recovery the original character sequence to be placed in the output stream.
259 to 265Character Group Transitions.
These symbols signal transitions between groups of characters defined within the Character Group processor. For example, if 2 groups are defined to be the lower case and upper case characters then the input stream:
"abcdefABCDEF" would become "abcdef<Change Group>abcdef"
On encoding, Character Group Transition symbols are generated by the Character Group processor and simply passed to the Huffman processor for encoding.
On decoding a Character Group Transition symbol, it is simply passed from the Huffman processor to the Character Group processor which takes the appropriate action based its current state and the group transition indicated.
266New UCS2 Row.
On encoding, if the next UCS2 character in the input stream has a "row octet" of a different value to that of the previous character in the input stream, the New UCS2 Row symbol is Huffman encoded to the CD and the 8 bit of the new row octet are copied unchanged to the CD. The new row octet is stored by the UCS2 processor as the "current row octet" and subsequent input characters within the current row are Huffman encoded as the 8 bit value of the character within the "current row".
On decoding the New UCS2 Row symbol, the symbol is discarded and the next 8 bits are read from the CD and stored by the UCS2 processor as the "current row octet". Subsequent UCS2 characters are decoded by treating the 8 bit character values decoded by the Huffman processor as characters within the "current row".
Up

5.4  Compression Footerp. 16

Although Compressed Data Stream Length (CDSL) - the total number of octets that contain the CDS - is known, the CD element of the CDS is a bit stream and therefore may not end on an octet boundary. The Compression Footer (CF) is used to indicate the end of the CD as follows:
  • Calculate the number of meaningful bits in the last octet of the CD (i.e. total CD bits modulo 8).
  • If the number of meaningful bits is >0 and <6 store the number of meaningful bits in bits 2 to 0 of the last octet. Otherwise extend the CD by adding 1 octet and store the number of meaningful bits in bits 2 to 0 of this new octet. In the case where the number of meaningful bits is 8 then bits 2 to 0 of the new octet are set to zero.
For example if there are 4 meaningful bits in the last CD octet, the CF will be constructed to occupy the shaded area in Table 7.
0 7 6 5 4 3 2 1 0
XXXXX100
Alternatively if there are 6 meaningful bits in the last CD octet, a new octet needs to be added. The CF will be constructed to occupy the shaded area in Table 8.
0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
XXXXXXX110
If there are 8 meaningful bits in the last CD octet, a new octet needs to be added. The CF will be constructed to occupy the shaded area in Table 8a.
0 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0
XXXXXXXXX000
In all the Tables above, the bits in the shaded area which have no bit value defined are set according to the particular bearer being used to transport compressed data. e.g. CBS. Where no particular reference is made regarding the value of those bits they may be set to any value.
Up

Up   Top   ToC