Internet Engineering Task Force (IETF) J. Lazzaro
Request for Comments: 6295 J. Wawrzynek
Obsoletes: 4695 UC Berkeley
Category: Standards Track June 2011
RTP Payload Format for MIDI
This memo describes a Real-time Transport Protocol (RTP) payload
format for the MIDI (Musical Instrument Digital Interface) command
language. The format encodes all commands that may legally appear on
a MIDI 1.0 DIN cable. The format is suitable for interactive
applications (such as network musical performance) and content-
delivery applications (such as file streaming). The format may be
used over unicast and multicast UDP and TCP, and it defines tools for
graceful recovery from packet loss. Stream behavior, including the
MIDI rendering method, may be customized during session setup. The
format also serves as a mode for the mpeg4-generic format, to support
the MPEG 4 Audio Object Types for General MIDI, Downloadable Sounds
Level 2, and Structured Audio. This document obsoletes RFC 4695.
Status of This Memo
This is an Internet Standards Track document.
This document is a product of the Internet Engineering Task Force
(IETF). It represents the consensus of the IETF community. It has
received public review and has been approved for publication by the
Internet Engineering Steering Group (IESG). Further information on
Internet Standards is available in Section 2 of RFC 5741.
Information about the current status of this document, any errata,
and how to provide feedback on it may be obtained at
Copyright (c) 2011 IETF Trust and the persons identified as the
document authors. All rights reserved.
This document is subject to BCP 78 and the IETF Trust's Legal
Provisions Relating to IETF Documents
(http://trustee.ietf.org/license-info) in effect on the date of
publication of this document. Please review these documents
carefully, as they describe your rights and restrictions with respect
to this document. Code Components extracted from this document must
include Simplified BSD License text as described in Section 4.e of
the Trust Legal Provisions and are provided without warranty as
described in the Simplified BSD License.
Table of Contents
1. Introduction ....................................................41.1. Terminology ................................................61.2. Bitfield Conventions .......................................62. Packet Format ...................................................62.1. RTP Header .................................................72.2. MIDI Payload ..............................................113. MIDI Command Section ...........................................133.1. Timestamps ................................................143.2. Command Coding ............................................164. The Recovery Journal System ....................................225. Recovery Journal Format ........................................246. Session Description Protocol ...................................286.1. Session Descriptions for Native Streams ...................296.2. Session Descriptions for mpeg4-generic Streams ............306.3. Parameters ................................................337. Extensibility ..................................................348. Congestion Control .............................................359. Security Considerations ........................................3510. Acknowledgements ..............................................3611. IANA Considerations ...........................................3711.1. rtp-midi Media Type Registration .........................3811.1.1. Repository Request for audio/rtp-midi .............4011.2. mpeg4-generic Media Type Registration ....................4211.2.1. Repository Request for Mode rtp-midi for
mpeg4-generic .....................................4411.3. asc Media Type Registration ..............................4612. Changes from RFC 4695 .........................................48Appendix A. The Recovery Journal Channel Chapters .................52A.1. Recovery Journal Definitions ..............................52A.2. Chapter P: MIDI Program Change ............................56A.3. Chapter C: MIDI Control Change ............................57A.3.1. Log Inclusion Rules ................................58A.3.2. Controller Log Format ..............................59A.3.3. Log List Coding Rules ..............................61A.3.4. The Parameter System ...............................64A.4. Chapter M: MIDI Parameter System ..........................66A.4.1. Log Inclusion Rules ................................68A.4.2. Log Coding Rules ...................................69A.4.2.1. The Value Tool ..............................71A.4.2.2. The Count Tool ..............................74A.5. Chapter W: MIDI Pitch Wheel ...............................74
C.7.1. MIDI Content-Streaming Applications ...............144C.7.2. MIDI Network Musical Performance Applications .....147Appendix D. Parameter Syntax Definitions .... ....................153Appendix E. A MIDI Overview for Networking Specialists ...........160E.1. Commands Types ...........................................162E.2. Running Status ...........................................163E.3. Command Timing ...........................................163E.4. AudioSpecificConfig Templates for MMA Renderers ..........164
Normative References ..........................................169
Informative References ........................................1701. Introduction
This document obsoletes [RFC4695].
The Internet Engineering Task Force (IETF) has developed a set of
focused tools for multimedia networking ([RFC3550] [RFC4566]
[RFC3261] [RFC2326]). These tools can be combined in different ways
to support a variety of real-time applications over Internet Protocol
For example, a telephony application might use the Session Initiation
Protocol (SIP, [RFC3261]) to set up a phone call. Call setup would
include negotiations to agree on a common audio codec [RFC3264].
Negotiations would use the Session Description Protocol (SDP,
[RFC4566]) to describe candidate codecs.
After a call is set up, audio data would flow between the parties
using the Real Time Protocol (RTP, [RFC3550]) under any applicable
profile (for example, the Audio/Visual Profile (AVP, [RFC3551])).
The tools used in this telephony example (SIP, SDP, and RTP) might be
combined in a different way to support a content-streaming
application, perhaps in conjunction with other tools, such as the
Real Time Streaming Protocol (RTSP, [RFC2326]).
The MIDI (Musical Instrument Digital Interface) command language
[MIDI] is widely used in musical applications that are analogous to
the examples described above. On stage and in the recording studio,
MIDI is used for the interactive remote control of musical
instruments, an application similar in spirit to telephony. On web
pages, Standard MIDI Files (SMFs, [MIDI]) rendered using the General
MIDI standard [MIDI] provide a low-bandwidth substitute for audio
[RFC4695] was motivated by a simple premise: if MIDI performances
could be sent as RTP streams that are managed by IETF session tools,
a hybridization of the MIDI and IETF application domains might occur.
For example, interoperable MIDI networking might foster network music
performance applications, in which a group of musicians located at
different physical locations interact over a network to perform as
they would if they were located in the same room [NMP]. As a second
example, the streaming community might begin to use MIDI for low-
bitrate audio coding, perhaps in conjunction with normative sound-
synthesis methods [MPEGSA].
Five years after [RFC4695], these applications have not yet reached
the mainstream. However, experiments in academia and industry
continue. This memo, which obsoletes [RFC4695] and fixes minor
errata (see Section 12), has been written in service of these
To enable MIDI applications to use RTP, this memo defines an RTP
payload format and its media type. Sections 2-5 and Appendices A and
B define the RTP payload format. Section 6 and Appendices C and D
define the media types identifying the payload format, the parameters
needed for configuration, and the utilization of the parameters in
Appendix C also includes interoperability guidelines for the example
applications described above: network musical performance using SIP
(Appendix C.7.2) and content streaming using RTSP (Appendix C.7.1).
Another potential application area for RTP MIDI is MIDI networking
for professional audio equipment and electronic musical instruments.
We do not offer interoperability guidelines for this application in
this memo. However, RTP MIDI has been designed with stage and studio
applications in mind, and we expect that efforts to define a stage
and studio framework will rely on RTP MIDI for MIDI transport
Some applications may require MIDI media delivery at a certain
service quality level (latency, jitter, packet loss, etc.). RTP
itself does not provide service guarantees. However, applications
may use lower-layer network protocols to configure the quality of the
transport services that RTP uses. These protocols may act to reserve
network resources for RTP flows [RFC2205] or may simply direct RTP
traffic onto a dedicated "media network" in a local installation.
Note that RTP and the MIDI payload format do provide tools that
applications may use to achieve the best possible real-time
performance at a given service level.
This memo normatively defines the syntax and semantics of the MIDI
payload format. However, this memo does not define algorithms for
sending and receiving packets. An ancillary document [RFC4696]
provides informative guidance on algorithms. Supplemental
information may be found in related conference publications [NMP]
Throughout this memo, the phrase "native stream" refers to a stream
that uses the rtp-midi media type. The phrase "mpeg4-generic stream"
refers to a stream that uses the mpeg4-generic media type (in mode
rtp-midi) to operate in an MPEG 4 environment [RFC3640]. Section 6
describes this distinction in detail.
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
"SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
document are to be interpreted as described in BCP 14, RFC 2119
1.2. Bitfield Conventions
Several bitfield coding idioms are used in this document. As most of
these idioms only appear in Appendices A and B, we define them in
However, a few of these idioms also appear in the main text of this
document. For convenience, we describe them below:
o R flag bit. R flag bits are reserved for future use. Senders
MUST set R bits to 0. Receivers MUST ignore R bit values.
o LENGTH field. All fields named LENGTH (as distinct from LEN) code
the number of octets in the structure that contains it, including
the header it resides in and all hierarchical levels below it. If
a structure contains a LENGTH field, a receiver MUST use the
LENGTH field value to advance past the structure during parsing,
rather than use knowledge about the internal format of the
2. Packet Format
In this section, we introduce the format of RTP MIDI packets. The
description includes some background information on RTP for the
benefit of MIDI implementors new to IETF tools. Implementors should
consult [RFC3550] for an authoritative description of RTP.
This memo assumes that the reader is familiar with MIDI syntax and
semantics. Appendix E provides a MIDI overview, at a level of detail
sufficient to understand most of this memo. Implementors should
consult [MIDI] for an authoritative description of MIDI.
The MIDI payload format maps a MIDI command stream (16 voice channels
+ systems) onto an RTP stream. An RTP media stream is a sequence of
logical packets that share a common format. Each packet consists of
two parts: the RTP header and the MIDI payload. Figure 1 shows this
format (vertical space delineates the header and payload).
We describe RTP packets as "logical" packets to highlight the fact
that RTP itself is not a network-layer protocol. Instead, RTP
packets are mapped onto network protocols (such as unicast UDP,
multicast UDP, or TCP) by an application [ALF]. The interleaved mode
of the Real Time Streaming Protocol (RTSP, [RFC2326]) is an example
of an RTP mapping to TCP transport, as is [RFC4571].
2.1. RTP Header
[RFC3550] provides a complete description of the RTP header fields.
In this section, we clarify the role of a few RTP header fields for
MIDI applications. All fields are coded in network byte order (big-
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
| V |P|X| CC |M| PT | Sequence number |
| Timestamp |
| SSRC |
| MIDI command section ... |
| Journal section ... |
Figure 1 -- Packet Format
The behavior of the 1-bit M field depends on the media type of the
stream. For native streams, the M bit MUST be set to 1 if the MIDI
command section has a non-zero LEN field and MUST be set to 0
otherwise. For mpeg4-generic streams, the M bit MUST be set to 1 for
all packets (to conform to [RFC3640]).
In an RTP MIDI stream, the 16-bit sequence number field is
initialized to a randomly chosen value and is incremented by one
(modulo 2^16) for each packet sent in the stream. A related
quantity, the 32-bit extended packet sequence number, may be computed
by tracking rollovers of the 16-bit sequence number. Note that
different receivers of the same stream may compute different extended
packet sequence numbers, depending on when the receiver joined the
The 32-bit timestamp field sets the base timestamp value for the
packet. The payload codes MIDI command timing relative to this
value. The timestamp units are set by the clock rate parameter. For
example, if the clock rate has a value of 44100 Hz, two packets whose
base timestamp values differ by 2 seconds have RTP timestamp fields
that differ by 88200.
Note that the clock rate parameter is not encoded within each RTP
MIDI packet. A receiver of an RTP MIDI stream becomes aware of the
clock rate as part of the session setup process. For example, if a
session management tool uses the Session Description Protocol (SDP,
[RFC4566]) to describe a media session, the clock rate parameter is
set using the rtpmap attribute. We show examples of session setup in
For RTP MIDI streams destined to be rendered into audio, the clock
rate SHOULD be an audio sample rate of 32 KHz or higher. This
recommendation is due to the sensitivity of human musical perception
to small timing errors in musical note sequences and due to the
timbral changes that occur when two near-simultaneous MIDI NoteOns
are rendered with a different timing than that desired by the content
author due to clock rate quantization. RTP MIDI streams that are not
destined for audio rendering (such as MIDI streams that control stage
lighting) MAY use a lower clock rate but SHOULD use a clock rate high
enough to avoid timing artifacts in the application.
For RTP MIDI streams destined to be rendered into audio, the clock
rate SHOULD be chosen from rates in common use in professional audio
applications or in consumer audio distribution. At the time of this
writing, these rates include 32 KHz, 44.1 KHz, 48 KHz, 64 KHz, 88.2
KHz, 96 KHz, 176.4 KHz, and 192 KHz. If the RTP MIDI session is a
part of a synchronized media session that includes another (non-MIDI)
RTP audio stream with a clock rate of 32 KHz or higher, the RTP MIDI
stream SHOULD use a clock rate that matches the clock rate of the
other audio stream. However, if the RTP MIDI stream is destined to
be rendered into audio, the RTP MIDI stream SHOULD NOT use a clock
rate lower than 32 KHz, even if this second stream has a clock rate
lower than 32 KHz.
Timestamps of consecutive packets do not necessarily increment at a
fixed rate because RTP MIDI packets are not necessarily sent at a
fixed rate. The degree of packet transmission regularity reflects
the underlying application dynamics. Interactive applications may
vary the packet-sending rate to track the gestural rate of a human
performer, whereas content-streaming applications may send packets at
a fixed rate.
Therefore, the timestamps for two sequential RTP packets may be
identical, or the second packet may have a timestamp arbitrarily
larger than the first packet (modulo 2^32). Section 3 places
additional restrictions on the RTP timestamps for two sequential RTP
packets, as does the guardtime parameter (Appendix C.4.2).
We use the term "media time" to denote the temporal duration of the
media coded by an RTP packet. The media time coded by a packet is
computed by subtracting the last command timestamp in the MIDI
command section from the RTP timestamp (modulo 2^32). If the MIDI
list of the MIDI command section of a packet is empty, the media time
coded by the packet is 0 ms. Appendix C.4.1 discusses media time
issues in detail.
We now define RTP session semantics, in the context of sessions
specified using the Session Description Protocol [RFC4566]. A
session description media line ("m=") specifies an RTP session. An
RTP session has an independent space of 2^32 synchronization sources.
Synchronization source identifiers are coded in the SSRC header field
of RTP session packets. The payload types that may appear in the PT
header field of RTP session packets are listed at the end of the
Several RTP MIDI streams may appear in an RTP session. Each stream
is distinguished by a unique SSRC value and has a unique sequence
number and RTP timestamp space. Multiple streams in the RTP session
may be sent by a single party. Multiple parties may send streams in
the RTP session. An RTP MIDI stream encodes data for a single MIDI
command name space (16 voice channels + systems).
Streams in an RTP session may use different payload types or they may
use the same payload type. However, each party may send, at most,
one RTP MIDI stream for each payload type mapped to an RTP MIDI
payload format in an RTP session. Recall that dynamic binding of
payload type numbers in [RFC4566] lets a party map many payload type
numbers to the RTP MIDI payload format; thus, a party may send many
RTP MIDI streams in a single RTP session. Pairs of streams (unicast
or multicast) that communicate between two parties in an RTP session
and that share a payload type have the same association as a MIDI
cable pair that cross-connects two devices in a MIDI 1.0 DIN network.
The RTP session architecture described above is efficient in its use
of network ports, as one RTP session (using a port pair per party)
supports the transport of many MIDI name spaces (16 MIDI channels +
systems). We define tools for grouping and labelling MIDI name
spaces across streams and sessions in Appendix C.5 of this memo.
The RTP header timestamps for each stream in an RTP session have
separately and randomly chosen initialization values. Receivers use
the timing fields encoded in the RTP Control Protocol (RTCP,
[RFC3550]) sender reports to synchronize the streams sent by a party.
The SSRC values for each stream in an RTP session are also separately
and randomly chosen, as described in [RFC3550]. Receivers use the
CNAME field encoded in RTCP sender reports to verify that streams
were sent by the same party and to detect SSRC collisions, as
described in [RFC3550].
In some applications, a receiver renders MIDI commands into audio (or
into control actions, such as the rewind of a tape deck or the
dimming of stage lights). In other applications, a receiver presents
a MIDI stream to software programs via an Application Programming
Interface (API). Appendix C.6 defines session configuration tools to
specify what receivers should do with a MIDI command stream.
If a multimedia session uses different RTP MIDI streams to send
different classes of media, the streams MUST be sent over different
RTP sessions. For example, if a multimedia session uses one MIDI
stream for audio and a second MIDI stream to control a lighting
system, the audio and lighting streams MUST be sent over different
RTP sessions, each with its own media line.
Session description tools defined in Appendix C.5 let a sending party
split a single MIDI name space (16 voice channels + systems) over
several RTP MIDI streams. Split transport of a MIDI command stream
is a delicate task because correct command stream reconstruction by a
receiver depends on exact timing synchronization across the streams.
To support split name spaces, we define the following requirements:
o A party MUST NOT send several RTP MIDI streams that share a MIDI
name space in the same RTP session. Instead, each stream MUST be
sent from a different RTP session.
o If several RTP MIDI streams sent by a party share a MIDI name
space, all streams MUST use the same SSRC value and MUST use the
same randomly chosen RTP timestamp initialization value.
These rules let a receiver identify streams that share a MIDI name
space (by matching SSRC values) and also let a receiver accurately
reconstruct the source MIDI command stream (by using RTP timestamps
to interleave commands from the two streams). Care MUST be taken by
senders to ensure that SSRC changes due to collisions are reflected
in both streams. Receivers MUST regularly examine the RTCP CNAME
fields associated with the linked streams to ensure that the assumed
link is legitimate and not the result of an SSRC collision by another
Except for the special cases described above, a party may send many
RTP MIDI streams in the same session. However, it is sometimes
advantageous for two RTP MIDI streams to be sent over different RTP
sessions. For example, two streams may need different values for RTP
session-level attributes (such as the sendonly and recvonly
attributes). As a second example, two RTP sessions may be needed to
send two unicast streams in a multimedia session that originate on
different computers (with different IP numbers). Two RTP sessions
are needed in this case because transport addresses are specified on
the RTP-session or multimedia-session level, not on a payload type
On a final note, in some uses of MIDI, parties send bidirectional
traffic to conduct transactions (such as file exchange). These
commands were designed to work over MIDI 1.0 DIN cable networks and
may be configured in a multicast topology, which uses pure "party-
line" signalling. Thus, if a multimedia session ensures a multicast
connection between all parties, bidirectional MIDI commands will work
without additional support from the RTP MIDI payload format.
2.2. MIDI Payload
The payload (Figure 1) MUST begin with the MIDI command section. The
MIDI command section codes a (possibly empty) list of timestamped
MIDI commands and provides the essential service of the payload
The payload MAY also contain a journal section. The journal section
provides resiliency by coding the recent history of the stream. A
flag in the MIDI command section codes the presence of a journal
section in the payload.
Section 3 defines the MIDI command section. Sections 4 and 5 and
Appendices A and B define the recovery journal, the default format
for the journal section. Here, we describe how these payload
sections operate in a stream in an RTP session.
The journalling method for a stream is set at the start of a session
and MUST NOT be changed thereafter. A stream may be set to use the
recovery journal, to use an alternative journal format (none are
defined in this memo), or not to use a journal.
The default journalling method of a stream is inferred from its
transport type. Streams that use unreliable transport (such as UDP)
default to using the recovery journal. Streams that use reliable
transport (such as TCP) default to not using a journal. Appendix
C.2.1 defines session configuration tools for overriding these
defaults. For all types of transport, a sender MUST transmit an RTP
packet stream with consecutive sequence numbers (modulo 2^16).
If a stream uses the recovery journal, every payload in the stream
MUST include a journal section. If a stream does not use
journalling, a journal section MUST NOT appear in a stream payload.
If a stream uses an alternative journal format, the specification for
the journal format defines an inclusion policy.
If a stream is sent over UDP transport, the Maximum Transmission Unit
(MTU) of the underlying network limits the practical size of the
payload section (for example, an Ethernet MTU is 1500 octets) for
applications where predictable and minimal packet transmission
latency is critical. A sender SHOULD NOT create RTP MIDI UDP packets
whose sizes exceed the MTU of the underlying network. Instead, the
sender SHOULD take steps to keep the maximum packet size under the
These steps may take many forms. The default closed-loop recovery
journal sending policy (defined in Appendix C.2.2.2) uses RTP Control
Protocol (RTCP, [RFC3550]) feedback to manage the RTP MIDI packet
size. In addition, Section 3.2 and Appendix B.5.2 provide specific
tools for managing the size of packets that code MIDI System
Exclusive (0xF0) commands. Appendix C.5 defines session
configuration tools that may be used to split a dense MIDI name space
into several UDP streams (each sent in a different RTP session, per
Section 2.1) so that the payload fits comfortably into an MTU.
Another option is to use TCP. Section 4.3 of [RFC4696] provides non-
normative advice for packet size management.
3. MIDI Command Section
Figure 2 shows the format of the MIDI command section.
0 1 2 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
|B|J|Z|P|LEN... | MIDI list ... |
Figure 2 -- MIDI Command Section
The MIDI command section begins with a variable-length header.
The header field LEN codes the number of octets in the MIDI list that
follow the header. If the header flag B is 0, the header is one
octet long, and LEN is a 4-bit field, supporting a maximum MIDI list
length of 15 octets.
If B is 1, the header is two octets long, and LEN is a 12-bit field,
supporting a maximum MIDI list length of 4095 octets. LEN is coded
in network byte order (big-endian): the 4 bits of LEN that appear in
the first header octet code the most significant 4 bits of the 12-bit
A LEN value of 0 is legal, and it codes an empty MIDI list.
If the J header bit is set to 1, a journal section MUST appear after
the MIDI command section in the payload. If the J header bit is set
to 0, the payload MUST NOT contain a journal section.
We define the semantics of the P header bit in Section 3.2.
If the LEN header field is nonzero, the MIDI list has the structure
shown in Figure 3.
| Delta Time 0 (1-4 octets long, or 0 octets if Z = 0) |
| MIDI Command 0 (1 or more octets long) |
| Delta Time 1 (1-4 octets long) |
| MIDI Command 1 (1 or more octets long) |
| ... |
| Delta Time N (1-4 octets long) |
| MIDI Command N (0 or more octets long) |
Figure 3 -- MIDI List Structure
If the header flag Z is 1, the MIDI list begins with a complete MIDI
command (coded in the MIDI Command 0 field in Figure 3) preceded by a
delta time (coded in the Delta Time 0 field). If Z is 0, the Delta
Time 0 field is not present in the MIDI list, and the command coded
in the MIDI Command 0 field has an implicit delta time of 0.
The MIDI list structure may also optionally encode a list of N
additional complete MIDI commands, each coded in a MIDI Command K
field. Each additional command MUST be preceded by a Delta Time K
field, which codes the command's delta time. We discuss exceptions
to the "command fields code complete MIDI commands" rule in Section
The final MIDI command field (i.e., the MIDI Command N field, shown
in Figure 3) in the MIDI list MAY be empty. Moreover, a MIDI list
MAY consist of a single delta time (encoded in the Delta Time 0
field) without an associated command (which would have been encoded
in the MIDI Command 0 field). These rules enable MIDI coding
features that are explained in Section 3.1. We delay the
explanations because an understanding of RTP MIDI timestamps is
necessary to describe the features.
In this section, we describe how RTP MIDI encodes a timestamp for
each MIDI list command. Command timestamps have the same units as
RTP packet header timestamps (described in Section 2.1 and
[RFC3550]). Recall that RTP timestamps have units of seconds, whose
scaling is set during session configuration (see Section 6.1 and
As shown in Figure 3, the MIDI list encodes time using a compact
delta time format. The RTP MIDI delta time syntax is a modified form
of the MIDI File delta time syntax [MIDI]. RTP MIDI delta times use
1-4 octet fields to encode 32-bit unsigned integers. Figure 4 shows
the encoded and decoded forms of delta times. Note that delta time
values may be legally encoded in multiple formats; for example, there
are four legal ways to encode the zero delta time (0x00, 0x8000,
RTP MIDI uses delta times to encode a timestamp for each MIDI
command. The timestamp for MIDI Command K is the summation (modulo
2^32) of the RTP timestamp and decoded delta times 0 through K. This
cumulative coding technique, borrowed from MIDI File delta time
coding, is efficient because it reduces the number of multi-octet
All command timestamps in a packet MUST be less than or equal to the
RTP timestamp of the next packet in the stream (modulo 2^32).
This restriction ensures that a particular RTP MIDI packet in a
stream is uniquely responsible for encoding time, starting at the
moment after the RTP timestamp encoded in the RTP packet header and
ending at the moment before the final command timestamp encoded in
the MIDI list. The "moment before" and "moment after" qualifiers
acknowledge the "less than or equal" semantics (as opposed to
"strictly less than") in the sentence above this paragraph.
Note that it is possible to "pad" the end of an RTP MIDI packet with
time that is guaranteed to be void of MIDI commands, by setting the
"Delta Time N" field of the MIDI list to the end of the void time and
by omitting its corresponding "MIDI Command N" field (a syntactic
construction the preamble of Section 3 expressly made legal).
In addition, it is possible to code an RTP MIDI packet to express
that a period of time in the stream is void of MIDI commands. The
RTP timestamp in the header would code the start of the void time.
The MIDI list of this packet would consist of a "Delta Time 0" field
that coded the end of the void time. No other fields would be
present in the MIDI list (a syntactic construction the preamble of
Section 3 also expressly made legal).
By default, a command timestamp indicates the execution time for the
command. The difference between two timestamps indicates the time
delay between the execution of the commands. This difference may be
zero, coding simultaneous execution. In this memo, we refer to this
interpretation of timestamps as "comex" (COMmand EXecution)
semantics. We formally define comex semantics in Appendix C.3.
The comex interpretation of timestamps works well for transcoding a
Standard MIDI File (SMF) into an RTP MIDI stream, as SMFs code a
timestamp for each MIDI command stored in the file. To transcode an
SMF that uses metric time markers, use the SMF tempo map (encoded in
the SMF as meta-events) to convert metric SMF timestamp units into
seconds-based RTP timestamp units.
The comex interpretation also works well for MIDI hardware
controllers that are coding raw sensor data directly onto an RTP MIDI
stream. Note that this controller design is preferable to a design
that converts raw sensor data into a MIDI 1.0 cable command stream
and then transcodes the stream onto an RTP MIDI stream.
The comex interpretation of timestamps is usually not the best
timestamp interpretation for transcoding a MIDI source that uses
implicit command timing (such as MIDI 1.0 DIN cables) into an RTP
MIDI stream. Appendix C.3 defines alternatives to comex semantics
and describes session configuration tools for selecting the timestamp
interpretation semantics for a stream.
One-Octet Delta Time:
Encoded form: 0ddddddd
Decoded form: 00000000 00000000 00000000 0ddddddd
Two-Octet Delta Time:
Encoded form: 1ccccccc 0ddddddd
Decoded form: 00000000 00000000 00cccccc cddddddd
Three-Octet Delta Time:
Encoded form: 1bbbbbbb 1ccccccc 0ddddddd
Decoded form: 00000000 000bbbbb bbcccccc cddddddd
Four-Octet Delta Time:
Encoded form: 1aaaaaaa 1bbbbbbb 1ccccccc 0ddddddd
Decoded form: 0000aaaa aaabbbbb bbcccccc cddddddd
Figure 4 -- Decoding Delta Time Formats3.2. Command Coding
Each non-empty MIDI Command field in the MIDI list codes one of the
MIDI command types that may legally appear on a MIDI 1.0 DIN cable.
Standard MIDI File meta-events do not fit this definition and MUST
NOT appear in the MIDI list. As a rule, each MIDI Command field
codes a complete command, in the binary command format defined in
[MIDI]. In the remainder of this section, we describe exceptions to
The first MIDI channel command in the MIDI list MUST include a status
octet. Running status coding, as defined in [MIDI], MAY be used for
all subsequent MIDI channel commands in the list. As in [MIDI],
System Common and System Exclusive messages (0xF0 ... 0xF7) cancel
the running status state, but System Real-Time messages (0xF8 ...
0xFF) do not affect the running status state. All system commands in
the MIDI list MUST include a status octet.
As we note above, the first channel command in the MIDI list MUST
include a status octet. However, the corresponding command in the
original MIDI source data stream might not have a status octet (in
this case, the source would be coding the command using running
status). If the status octet of the first channel command in the
MIDI list does not appear in the source data stream, the P (phantom)
header bit MUST be set to 1. In all other cases, the P bit MUST be
set to 0.
Note that the P bit describes the MIDI source data stream, not the
MIDI list encoding; regardless of the state of the P bit, the MIDI
list MUST include the status octet.
As receivers MUST be able to decode running status, sender
implementors should feel free to use running status to improve
bandwidth efficiency. However, senders SHOULD NOT introduce timing
jitter into an existing MIDI command stream through an inappropriate
use or removal of running status coding. This warning primarily
applies to senders whose RTP MIDI streams may be transcoded onto a
MIDI 1.0 DIN cable [MIDI] by the receiver: both the timestamps and
the command coding (running status or not) must comply with the
physical restrictions of implicit time coding over a slow serial
On a MIDI 1.0 DIN cable [MIDI], a System Real-Time command may be
embedded inside of another "host" MIDI command. This syntactic
construction is not supported in the payload format: a MIDI Command
field in the MIDI list codes exactly one MIDI command (partially or
To encode an embedded System Real-Time command, senders MUST extract
the command from its host and code it in the MIDI list as a separate
command. The host command and System Real-Time command SHOULD appear
in the same MIDI list. The delta time of the System Real-Time
command SHOULD result in a command timestamp that encodes the System
Real-Time command placement in its original embedded position.
Two methods are provided for encoding MIDI System Exclusive (SysEx)
commands in the MIDI list. A SysEx command may be encoded in a MIDI
Command field verbatim: a 0xF0 octet, followed by an arbitrary number
of data octets, followed by a 0xF7 octet.
Alternatively, a SysEx command may be encoded as multiple segments.
The command is divided into two or more SysEx command segments; each
segment is encoded in its own MIDI Command field in the MIDI list.
The payload format supports segmentation in order to encode SysEx
commands that encode information in the temporal pattern of data
octets. By encoding these commands as a series of segments, each
data octet may be associated with a distinct delta time.
Segmentation also supports the coding of large SysEx commands across
To segment a SysEx command, first partition its data octet list into
two or more sublists. The last sublist MAY be empty (i.e., contain
no octets); all other sublists MUST contain at least one data octet.
To complete the segmentation, add the status octets defined in Figure
5 to the head and tail of the first, last, and any "middle" sublists.
Figure 6 shows example segmentations of a SysEx command.
A sender MAY cancel a segmented SysEx command transmission that is in
progress by sending the "cancel" sublist shown in Figure 5. A
"cancel" sublist MAY follow a "first" or "middle" sublist in the
transmission but MUST NOT follow a "last" sublist. The cancel MUST
be empty (thus, 0xF7 0xF4 is the only legal cancel sublist).
The cancellation feature is needed because Appendix C.1 defines
configuration tools that let session parties exclude certain SysEx
commands in the stream. Senders that transcode a MIDI source onto an
RTP MIDI stream under these constraints have the responsibility of
excluding undesired commands from the RTP MIDI stream.
The cancellation feature lets a sender start the transmission of a
command before the MIDI source has sent the entire command. If a
sender determines that the command whose transmission is in progress
should not appear on the RTP stream, it cancels the command. Without
a method for cancelling a SysEx command transmission, senders would
be forced to use a high-latency store-and-forward approach to
transcoding SysEx commands onto RTP MIDI packets, in order to
validate each SysEx command before transmission.
The recommended receiver reaction to a cancellation depends on the
capabilities of the receiver. For example, a sound synthesizer that
is directly parsing RTP MIDI packets and rendering them to audio will
be aware of the fact that SysEx commands may be cancelled in RTP
MIDI. These receivers SHOULD detect a SysEx cancellation in the MIDI
list and act as if they had never received the SysEx command.
As a second example, a synthesizer may be receiving MIDI data from an
RTP MIDI stream via a MIDI DIN cable (or a software API emulation of
a MIDI DIN cable). In this case, an RTP-MIDI-aware system receives
the RTP MIDI stream and transcodes it onto the MIDI DIN cable (or its
emulation). Upon the receipt of the cancel sublist, the RTP-MIDI-
aware transcoder might have already sent the first part of the SysEx
command on the MIDI DIN cable to the receiver.
Unfortunately, the MIDI DIN cable protocol cannot directly code
"cancel SysEx in progress" semantics. However, MIDI DIN cable
receivers begin SysEx processing after the complete command arrives.
The receiver checks to see if it recognizes the command (coded in the
first few octets) and then checks to see if the command is the
correct length. Thus, in practice, a transcoder can cancel a SysEx
command by sending an 0xF7 to (prematurely) end the SysEx command --
the receiver will detect the incorrect command length and discard the
Appendix C.1 defines configuration tools that may be used to prohibit
SysEx command cancellation.
The relative ordering of SysEx command segments in a MIDI list must
match the relative ordering of the sublists in the original SysEx
command. By default, commands other than System Real-Time MIDI
commands MUST NOT appear between SysEx command segments (Appendix C.1
defines configuration tools to change this default to let other
commands types appear between segments). If the command segments of
a SysEx command are placed in the MIDI lists of two or more RTP
packets, the segment ordering rules apply to the concatenation of all
affected MIDI lists.
| Sublist Position | Head Status Octet | Tail Status Octet |
| first | 0xF0 | 0xF0 |
| middle | 0xF7 | 0xF0 |
| last | 0xF7 | 0xF7 |
| cancel | 0xF7 | 0xF4 |
Figure 5 -- Command Segmentation Status Octets
[MIDI] permits 0xF7 octets that are not part of a (0xF0, 0xF7) pair
to appear on a MIDI 1.0 DIN cable. Unpaired 0xF7 octets have no
semantic meaning in MIDI apart from cancelling running status.
Unpaired 0xF7 octets MUST NOT appear in the MIDI list of the MIDI
Command section. We impose this restriction to avoid interference
with the command segmentation coding defined in Figure 5.
SysEx commands carried on a MIDI 1.0 DIN cable may use the "dropped
0xF7" construction [MIDI]. In this coding method, the 0xF7 octet is
dropped from the end of the SysEx command, and the status octet of
the next MIDI command acts both to terminate the SysEx command and
start the next command. To encode this construction in the payload
format, follow these steps:
o Determine the appropriate delta times for the SysEx command and
the command that follows the SysEx command.
o Insert the "dropped" 0xF7 octet at the end of the SysEx command to
form the standard SysEx syntax.
o Code both commands into the MIDI list using the rules above.
o Replace the 0xF7 octet that terminates the verbatim SysEx encoding
or the last segment of the segmented SysEx encoding with a 0xF5
octet. This substitution informs the receiver of the original
"dropped 0xF7" coding.
[MIDI] reserves the undefined System Common commands 0xF4 and 0xF5
and the undefined System Real-Time commands 0xF9 and 0xFD for future
use. By default, undefined commands MUST NOT appear in a MIDI
Command field in the MIDI list, with the exception of the 0xF5 octets
used to code the "dropped 0xF7" construction and the 0xF4 octets used
by SysEx "cancel" sublists.
During session configuration, a stream may be customized to transport
undefined commands (Appendix C.1). For this case, we now define how
senders encode undefined commands in the MIDI list.
An undefined System Real-Time command MUST be coded using the System
If the undefined System Common commands are put to use in a future
version of [MIDI], the command will begin with an 0xF4 or 0xF5 status
octet, followed by an arbitrary number of data octets (i.e., zero or
more data bytes). To encode these commands, senders MUST terminate
the command with an 0xF7 octet and place the modified command into
the MIDI Command field.
Unfortunately, non-compliant uses of the undefined System Common
commands may appear in MIDI implementations. To model these
commands, we assume that the command begins with an 0xF4 or 0xF5
status octet, followed by zero or more data octets, followed by zero
or more trailing 0xF7 status octets. To encode the command, senders
MUST first remove all trailing 0xF7 status octets from the command.
Then, senders MUST terminate the command with an 0xF7 octet and place
the modified command into the MIDI Command field.
Note that we include the trailing octets in our model as a cautionary
measure: if such commands appeared in a non-compliant use of an
undefined System Common command, an RTP MIDI encoding of the command
that did not remove trailing octets could be mistaken for an encoding
of the "middle" or "last" sublist of a segmented SysEx command
(Figure 5) under certain packet loss conditions.
Original SysEx command:
0xF0 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0xF7
A two-segment segmentation:
0xF0 0x01 0x02 0x03 0x04 0xF0
0xF7 0x05 0x06 0x07 0x08 0xF7
A different two-segment segmentation:
0xF0 0x01 0xF0
0xF7 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0xF7
A three-segment segmentation:
0xF0 0x01 0x02 0xF0
0xF7 0x03 0x04 0xF0
0xF7 0x05 0x06 0x07 0x08 0xF7
The segmentation with the largest number of segments:
0xF0 0x01 0xF0
0xF7 0x02 0xF0
0xF7 0x03 0xF0