This clause contains considerations on how to use media in RTP, packetization guidelines, and other transport considerations. The use of ECN for RTP sessions is also described for speech in this clause.
The general handling of bitrate variations is described in clause 184.108.40.206. Media specific handling is described in clause 220.127.116.11 for video.
This clause describes how the speech media should be packetized during a session. It includes definitions both for the cases where the access type is known and one default operation for the case when the access type is not known.
Requirements for transmission of DTMF events are described in Annex G.
When the radio access bearer technology is not known to the MTSI client, the default encapsulation parameters defined in Table 7.1 shall be used.
The codec modes and the other codec parameters (mode-change-capability, mode-change-period, mode-change-neighbor, etc), applicable for each session, are negotiated as described in clauses 18.104.22.168 and 22.214.171.124.
When transmitting AMR or AMR-WB encoded media, codec mode changes should be aligned to every other frame border and should be performed to one of the neighbouring codec modes in the negotiated mode set, except for a MTSI media gateway, see clause 126.96.36.199. In the transmitted media, the highest codec mode of the negotiated mode-set (or of all modes, if no mode-set was included in the SDP answer) should be used, unless it is restricted by the most recently received CMR. In the received media, codec mode changes shall be accepted at any frame border and to any codec mode within the negotiated mode set.
The bandwidth-efficient payload format should be used for AMR and AMR-WB encoded media unless the session setup determines that the octet-aligned payload format must be used.
The adaptation of codec mode, aggregation and redundancy is defined in clause 10.2.
For AMR and AMR-WB, if the highest mode within the negotiated mode-set is acceptable for media reception, then the MTSI client in terminal shall either indicate that no codec mode request is present (i.e. value 15) or shall indicate the CMR value corresponding to the highest mode within the negotiated mode-set in the CMR bits in the AMR and/or AMR-WB payload format  in every outgoing RTP packet. Otherwise the highest acceptable mode within the negotiated mode-set shall be sent in CMR in each outgoing RTP packet.
For AMR and AMR-WB, the MTSI client shall accept that the remote party sends with lower modes within the negotiated mode-set than requested by the CMR. The MTSI client shall follow each received CMR and shall not use higher modes in media-encoding than indicated by the most recently received CMR, while lower modes within the negotiated mode-set are allowed any time.
The MTSI client shall accept Codec Mode Requests signalled with the CMR bits in the AMR and/or AMR-WB payload format in every incoming RTP packet.
For EVS, the CMR related procedures in subclause A.188.8.131.52 of TS 26.445 apply.
If the MTSI client supports RTCP-APP packets, it shall also accept CMR in every incoming RTCP-APP packet.
The MTSI client shall follow each received CMR as soon as possible.
The MTSI client should send one speech frame encapsulated in each RTP packet unless the session setup or adaptation request defines that the other MTSI client wants to receive another encapsulation variant.
The MTSI client should request to receive one speech frame encapsulated in each RTP packet but shall accept any number of frames per RTP packet up to the maximum limit of 12 speech frames per RTP packet.
For application-layer redundancy, see clause 9.2.
Use default operation as defined in clause 184.108.40.206.2, except that the MTSI client in terminal:
should send 0, 1, 2, 3 or 4 non-redundant speech frames encapsulated in each RTP packet unless the session setup or adaptation request defines that other PS end-point want to receive another encapsulation variant;
should request receiving 1 to 4 speech frames in each RTP packet but shall accept any number of frames per RTP packet up to the maximum limit of 12 speech frames per RTP packet;
may use application layer redundancy, in which case the MTSI client in terminal may encapsulate up to 12 speech frames in each RTP packet, with a maximum of four non-redundant speech frames.
To avoid congestion on the link and to improve inter-working with CS GERAN when AMR or AMR-WB is used and when more than one codec mode is allowed in the session, the MTSI client in terminal should limit the initial codec mode (ICM) to one of the lowest codec modes for an Initial Waiting Time from the beginning of the RTP stream, or until it receives one of the following:
a frame-block with rate control information; or:
an RTCP message with rate control information; or:
reception quality feedback information, e.g. PLR or jitter in RTCP Sender Reports or Receiver Reports, indicating that the currently used codec mode is too high for the current operating condition.
The value for the Initial Waiting Time is 600 ms when ECN is not used and 500 ms when ECN is used, unless configured differently by the MTSI Media Adaptation Management as described in clause 17.
The rate control information can either be: a CMR with a value other than '15' in the RTP payload; or a CMR with a value other than '15' in an RTCP_APP message (see clause 10.2.1).
If no rate control information is received within the Initial Waiting Time, then the sending MTSI client in terminal should gradually increase the codec mode from the ICM towards the highest codec mode allowed in the session. While not detecting poor transmission performance or not receiving rate control information, the sending MTSI client in terminal should use step-wise up-switch to avoid introducing congestion during the upwards adaptation. The step-wise up-switch should be performed by switching to the next higher codec mode in the allowed mode set and then waiting for an Initial Up-switch Waiting Time before each subsequent up-switch until the first down-switch occurs.
The value for the Initial Up-switch Waiting Time is 600 ms when ECN is not used and 500 ms when ECN is used, unless configured differently by the MTSI Media Adaptation Management as described in clause 17.
The following rules can be used for determining the ICM:
If 1 codec mode is included in the mode-set then this should be the ICM.
If 2 or 3 codec modes are included in the mode-set then the ICM should be the codec mode with the lowest rate.
If 4 or more codec modes are included in the mode-set then the ICM should be the codec mode with the 2nd lowest rate.
When the EVS AMR-WB IO mode is used from the start of the session, the Initial Codec Mode (ICM) should be selected as defined in clause 220.127.116.11.6 for AMR-WB.
When EVS Primary mode is used from the start of the session, the following principles apply for the selection of the Initial Codec Mode bit-rate (ICMbr):
If GBR is known and if GBR is less than MBR, the ICMbr should be aligned with the GBR or should be lower than GBR.
When EVS Primary mode is used from the start of the session, the Initial Codec Mode audio bandwidth (ICMab) should be the highest audio bandwidth negotiated for the Initial Codec Mode bit-rate (ICMbr).
An MTSI client may support dual-mono operation for EVS.
The packetization of dual-mono for EVS is described in . When the EVS Primary mode is used for dual-mono encoding, the Header-full format must be used for all RTP packets.
When offering dual-mono for an RTP payload type number, the number of channels is set to 2, see SDP example in Annex A.14.
An MTSI client should follow general strategies for error-resilient coding (segmentation) and packetization as specified by each codec  and RTP payload format  specification. Further guidelines on how the video media data should be packetized during a session are provided in this clause.
Coded pictures should be encoded into individual segments:
For H.264 (AVC), a slice corresponds to such a segment.
For H.265 (HEVC), a slice segment corresponds to such a segment.
Each individual segment should be encapsulated in one RTP packet. Each RTP packet should be smaller than the Maximum Transfer Unit (MTU) size.
Real-time text is intended for human conversation applications. Text shall not be transferred with higher rate than 30 characters per second (as defined for cps in Section 6 of RFC 4103). A text-capable MTSI client shall be able to receive text with cps set up to 30.
RTCP SR shall be used for media synchronization by setting the NTP and RTP timestamps according to RFC 3550. To enable quick media synchronization when a new media component is added, or an MTSI session is initiated, the RTP sender should send RTCP Sender Reports for all newly started media components as early as possible.
The media synchronization requirements for real-time text are relaxed. A synchronization error between text and other media of a maximum of 3 seconds is accepted. Since this is longer than the maximum accepted latency, no specific methods need to be applied to assure to meet the requirement
Once the ECN negotiation has been completed as defined in , then only ECT(0) shall be used when marking packets with ECT, . When ECN is used for an RTP stream then the sending MTSI client shall mark every packet with ECT until the end of the session or until the session is re-negotiated to no longer use ECN. The leap-of-faith method is used for the ECN initiation.
Handling of ECN Congestion Experience (ECN-CE) marked packets is described in clause 10.
An MTSI client in terminal using variable bitrate encoding shall ensure for speech, and should ensure for video, that the transmitted bandwidth does not exceed the negotiated bandwidth (b=AS) nor the QoS bandwidth parameters (MBR, GBR) for the bearer, if defined and known. This can, in general, be done in two ways: either by controlling the bit-rate used for encoding each media frame or by controlling when the generated packets are transmitted.
The method to control the encoding bitrate and/or the transmission of packets is left for the discretion of the implementation. However, the average bandwidth of transmitted packets shall be calculated over a sliding window that is no longer than T seconds. Different media types may use different window lengths. The default window length T is 2 seconds if nothing else is specified below.
The bitrate control ensures that the average bandwidth of transmitted media packets does not exceed the maximum allowed bandwidth in the sending direction.
The maximum allowed bandwidth for the sending direction is the smaller of:
The b=AS bandwidth,
The MBR for the bearer, after compensating for the RTCP bandwidth, if known.
Video encoders use variable bitrate encoding to ensure a sufficiently low average bit-rate, which allow temporarily encoding a few frames at high bit-rate to match the video content with high motion activity or complex scenes, or to send an Intra frame, e.g. , when a Full Intra Request (FIR) is received. The bitrate control should ensure that the average bandwidth of transmitted media packets does not exceed the maximum allowed bandwidth in the sending direction.
The bit-rate control requires managing the following properties:
The size of large frame to transmit;
The proportion of neighboring frames that can be reduced in size;
The length of averaging window.
When any two of the properties are known, it is possible to determine the third.
TR 26.924Annex A describes a method for determining the length of the averaging window from the other two properties. The same method can also be used to determine the portion of the bit-rate for frames neighboring the large frame, which needs to be reduced depending on the size of the frame and the length of the averaging window.
The recommended procedure consists of reducing the encoding bit-rate of the frames neighboring the large frame in a balanced fashion, taking into account the spatio-temporal tradeoff of video quality when the encoding parameters such as quantization step or frame rate are adjusted. This may even require dropping some frames before they are encoded.
When a FIR is received, the MTSI client in terminal may delay the generation and transmission of an Intra frame by up to a half of the averaging window.
When speech is operated in a Source-Controlled Variable Bit Rate mode (e.g., EVS 5.9VBR), both the GBR and MBR need to be set at least as high as the highest rate used by the codec in this mode (e.g., 8.0kbps for EVS 5.9VBR). These rates may be set higher for a session if, in addition to the VBR mode, a higher rate codec mode is also negotiated for the session. Therefore, packet policing on the MBR or GBR will not prevent the transport of VBR media packets regardless of the averaging window used.