<abstract>This specification defines a Jingle application type for negotiating one or more sessions that use the Real-time Transport Protocol (RTP) to exchange media such as voice or video. The application type includes a straightforward mapping to Session Description Protocol (SDP) for interworking with SIP media endpoints.</abstract>
<remark><p>Added name attribute to active element to mirror usage for mute element; clarified meaning of session in the context of this specification; recommended that all sessions established via the same Jingle negotiation should be treated as synchronized.</p></remark>
<remark><p>In accordance with list consensus, generalized to cover all RTP media, not just audio; corrected text regarding payload types sent by responder in order to match SDP approach.</p></remark>
<remark><p>Removed info message for busy since it is now a Jingle-specific error condition defined in XEP-0166; defined info message for active.</p></remark>
<remark><p>Specified Jingle conformance, including the preference for lossy transports over reliable transports and the process of sending and receiving audio content over each transport type.</p></remark>
<remark><p>Renamed to mention RTP as the associated transport; corrected negotiation flow to be consistent with SIP/SDP (each party specifies a list of the payload types it can receive); added profile attribute to content element in order to specify RTP profile in use.</p></remark>
<remark><p>Specified how to include SDP parameters and codec-specific parameters; clarified negotiation process; added Speex examples; removed queued info message.</p></remark>
<remark><p>Defined info message for busy; added info message examples; recommended use of Speex; updated schema and XMPP Registrar considerations.</p></remark>
<p>&xep0166; can be used to initiate and negotiate a wide range of peer-to-peer sessions. One session type of interest is media such as voice or video. This document specifies an application format for negotiating Jingle media sessions, where the media is exchanged over the Realtime Transport Protocol (RTP; see &rfc3550;).</p>
<p>In accordance with Section 8 of <cite>XEP-0166</cite>, this document specifies the following information related to the Jingle RTP application type:</p>
<li><p>The application format negotiation process is defined in the <linkurl='#negotiation'>Negotiating a Jingle RTP Session</link> section of this document.</p></li>
<li><p>A mapping of Jingle semantics to the Session Description Protocol is provided in the <linkurl='#sdp'>Mapping to Session Description Protocol</link> section of this document.</p></li>
<li><p>A Jingle RTP session SHOULD use a lossy transport method such as &xep0177; or the "ice-udp" method specified in &xep0176;, but MAY use a reliable transport such as "ice-tcp" if a low-bandwidth codec is employed.</p></li>
<li><p>For lossy transports, outbound content shall be encoded into RTP packets and each packet shall be sent individually over the transport. Each inbound packet received over the transport is an RTP packet.</p></li>
<li><p>For reliable transports, outbound content shall be encoded into RTP packets and each packet data shall be sent in succession over the transport. Incoming data received over the transport shall be processed as a stream of RTP packets, where each RTP packet boundary marks the location of the next packet.</p></li>
<p>A Jingle RTP session is described by a content type that contains one application format and one transport method. Each <content/> element defines a single RTP session. A Jingle negotiation MAY result in the establishment of multiple RTP sessions (e.g., one for audio and one for video). An application SHOULD consider all of the RTP sessions that are established via the same Jingle negotiation to be synchronized for purposes of streaming, playback, recording, etc.</p>
<p>The application format consists of one or more encodings contained within a wrapper <description/> element qualified by the 'urn:xmpp:tmp:jingle:apps:rtp' namespace &NSNOTE;. In the language of <cite>RFC 4566</cite> each encoding is a payload-type; therefore, each <payload-type/> element specifies an encoding that can be used for the RTP stream, as illustrated in the following example.</p>
<p>The encodings SHOULD be provided in order of preference by placing the most-preferred &PAYLOADTYPE; element as the first child of the &DESCRIPTION; element (etc.).</p>
<p>The allowable attributes of the &PAYLOADTYPE; element are as follows:</p>
<p>In Jingle RTP, the encodings are used in the context of RTP. The most common encodings for the Audio/Video Profile (AVP) of RTP are listed in &rfc3551; (these "static" types are reserved from payload ID 0 through payload ID 95), although other encodings are allowed (these "dynamic" types use payload IDs 96 to 127) in accordance with the dynamic assignment rules described in Section 3 of <cite>RFC 3551</cite>. The payload IDs are represented in the 'id' attribute.</p>
<p>Each <payload-type/> element MAY contain one or more child elements that specify particular parameters related to the payload. For example, as described in &rtpspeex;, the "cng", "mode", and "vbr" parameters may be specified in relation to usage of the Speex <note>See <<linkurl='http://www.speex.org/'>http://www.speex.org/</link>>.</note> codec. Where such parameters are encoded via the "fmtp" SDP attribute, they shall be represented in Jingle via the following format:</p>
<p>The order of parameter elements MUST be ignored.</p>
<p>Parameter names MUST be treated as case-sensitive. However, parameter names are effectively guaranteed to be unique, since &IANA; maintains a registry of SDP parameters (see <<linkurl='http://www.iana.org/assignments/sdp-parameters'>http://www.iana.org/assignments/sdp-parameters</link>>).</p>
<section1topic='Negotiating a Jingle RTP Session'anchor='negotiation'>
<p>In general, the process for negotiating a Jingle RTP session is as follows:</p>
<code><![CDATA[
Initiator Responder
| |
| session-initiate |
|---------------------------->|
| ack |
|<----------------------------|
| [transport negotiation] |
|<--------------------------->|
| session-accept |
|<----------------------------|
| ack |
|---------------------------->|
| AUDIO (RTP) |
|<===========================>|
| |
]]></code>
<p>When the initiator sends a session-initiate stanza to the responder, the &DESCRIPTION; element includes all of the payload types that the initiator can send and/or receive for Jingle RTP, each one encapsulated in a separate &PAYLOADTYPE; element (the rules specified in &rfc3264; SHOULD be followed regarding inclusion of payload types).</p>
<p>Upon receiving the session-initiate stanza, the responder determines whether it can proceed with the negotiation. The general Jingle error cases are specified in <cite>XEP-0166</cite> and illustrated in the <linkurl='#scenarios'>Scenarios</link> section of this document.</p>
<p>After successful transport negotiation (not shown here), the responder accepts the session by sending a session-accept action to the initiator. The session-accept SHOULD include a subset of the payload types sent by the initiator, i.e., a list of the offered payload types that the responder can send and/or receive. The list that the responder sends SHOULD retain the ID numbers specified by the initiator. The order of the &PAYLOADTYPE; elements indicates the responder's preferences, with the most-preferred types first.</p>
<p>In the following example, we imagine that the responder supports Speex at clockrate of 8000 but not 16000, G729, and PCMU but not PMCA. Therefore the responder returns only two payload types.</p>
<p>The SDP media type for Jingle RTP is "audio" (see Section 8.2.1 of <cite>RFC 4566</cite>) for audio media, "video" (see Section 8.2.1 of <cite>RFC 4566</cite>) for video media, etc.</p>
<p>If the payload type is static (payload-type IDs 0 through 95 inclusive), it MUST be mapped to a media field defined in <cite>RFC 4566</cite>. The generic format for the media field is as follows:</p>
<p>In the context of Jingle audio sessions, the <media> is "audio" or "video" or some other media type, the <port> is the preferred port for such communications (which may be determined dynamically), and the <fmt list> is the payload-type ID.</p>
<p>For example, consider the following static payload-type:</p>
<examplecaption="Jingle format for static payload-type"><![CDATA[
<payload-typeid="13"name="CN"/>
]]></example>
<p>That Jingle-formatted information would be mapped to SDP as follows:</p>
<examplecaption="SDP mapping of static payload-type"><![CDATA[
m=audio 9999 RTP/AVP 13
]]></example>
<p>If the payload type is dynamic (payload-type IDs 96 through 127 inclusive), it SHOULD be mapped to an SDP media field plus an SDP attribute field named "rtpmap".</p>
<p>For example, consider a payload of 16-bit linear-encoded stereo audio sampled at 16KHz associated with dynamic payload-type 96:</p>
<examplecaption="Jingle format for dynamic payload-type"><![CDATA[
<p>That Jingle-formatted information would be mapped to SDP as follows:</p>
<examplecaption="SDP mapping of dynamic payload-type"><![CDATA[
m=audio 9999 RTP/AVP 96
a=rtpmap:96 speex/16000
]]></example>
<p>As noted, if additional parameters are to be specified, they shall be represented as attributes of the <parameter/> child of the &PAYLOADTYPE; element, as in the following example.</p>
<section1topic='Negotiation of SRTP'anchor='srtp'>
<p>&rfc3711; defines the Secure Real-time Transport Protocol, and &rfc4568; defines the SDP "crypto" attribute for signalling and negotiating the use of SRTP in the context of offer-answer protocols such as SIP. To enable the use of SRTP and gatewaying to non-XMPP technologies that make use of the "crypto" SDP attribute, we define a corresponding <crypto/> element qualified by the 'urn:xmpp:tmp:jingle:apps:rtp' namespace.</p>
<p>If the initiator wishes to use SRTP, the session-initiate MUST include at least one <crypto/> element and MAY multiple instances of the element. The <crypto/> element MUST be a child of the <description/> element.</p>
<p>The XML attributes of the <crypto/> element are as follows:</p>
<ul>
<li>crypto-suite -- this maps to the SDP "crypto-suite" parameter and has the same semantics (i.e., it is an identifier that describes the encryption and authentication algorithms).</li>
<li>key-params -- this maps to the SDP "key-params" parameter and has the same semantics (i.e., it provides one or more sets of keying material for the crypto-suite in question).</li>
<li>session-params -- this maps to the SDP "session-params" parameter and has the same semantics (i.e., it provides transport-specific parameters for SRTP negotiation).</li>
<li>tag -- this maps to the SDP "tag" parameter and has the same semantics (i.e., it is a decimal number used as an identifier for a particular crypto element).</li>
<p>When the responder receives a session-initiate action containing one or more instances of the <crypto/> element, it MUST either accept one of the <crypto/> elements or reject the offer by sending a session-terminate action with a reason of <invalid-crypto/>.</p>
<p>Informational messages may be sent by either party within the context of Jingle to communicate the status of a Jingle RTP session, device, or principal. The informational message MUST be an IQ-set containing a &JINGLE; element of type "session-info", where the informational message is a payload element qualified by the 'urn:xmpp:tmp:jingle:apps:rtp:info' namespace; the following payload elements are defined: <note>A <trying/> element (equivalent to the SIP 100 Trying response code) is not necessary, since each session-level action is acknowledged via XMPP IQ semantics.</note></p>
<td>The principal or device is again actively participating in the session after having been on hold or on mute. The <active/< element MAY possess a 'name' attribute whose value specifies a particular session that is again active (e.g., activating the video aspect but not the voice aspect of a voice+video chat). If no 'name' attribute is included, the recipient MUST assume that all sessions are active.</td>
<td>The principal is temporarily stopping media output but continues to accept media input. The <mute/< element MAY possess a 'name' attribute whose value specifies a particular session to be muted (e.g., muting the video aspect but not the voice aspect of a voice+video chat). If no 'name' attribute is included, the recipient MUST assume that all sessions are to be muted.</td>
<td>The device is ringing but the principal has not yet interacted with it to answer (this maps to the SIP 180 response code).</td>
</tr>
</table>
<p>Note: Because the informational message is sent in an IQ-set, the receiving party MUST return either an IQ-result or an IQ-error (normally only an IQ-result to acknowledge receipt; no error flows are defined or envisioned at this time).</p>
</section2>
<section2topic='Examples'anchor='info-examples'>
<examplecaption="Responder sends active message"><![CDATA[
<p>If an entity supports Jingle RTP session, it MUST advertise that fact by returning a feature of "urn:xmpp:tmp:jingle:apps:rtp" &NSNOTE; in response to &xep0030; information requests.</p>
<p>In this scenario, Romeo initiates a voice chat with Juliet using a transport method of ICE-UDP. The parties also exchange informational messages.</p>
<p>Because the parties have chosen the Jingle ICE-UDP Transport Method, the initiator and responder exchange an open-ended number of possible candidate transports, perform connectivity checks, and agree upon a candidate transport as explained in <cite>XEP-0176</cite>. Once ICE negotiation is completed, the responder sends a session-accept action to the initiator.</p>
<p>If the payload types and transport candidate can be successfully used by both parties, then the initiator acknowledges the session-accept action.</p>
<p>The parties now begin to exchange media. In this case they would exchange audio using the Speex codec at a clockrate of 8000 since that is the highest-priority codec for the responder (as determined by the XML order of the &PAYLOADTYPE; children).</p>
<p>The parties may continue the session as long as desired.</p>
<p>Eventually, one of the parties terminates the session.</p>
<p>In this scenario, Romeo initiates a combined audio and video chat with Juliet using a transport method of ICE-UDP. Juliet at first refuses the video portion, then later offers to add video, which Romeo accepts. The parties also exchange various informational messages</p>
<p>Because the parties have chosen the Jingle ICE-UDP Transport Method, the initiator and responder exchange an open-ended number of possible candidate transports, perform connectivity checks, and agree upon a candidate transport as explained in <cite>XEP-0176</cite>. Once ICE negotiation is completed, the responder sends a session-accept action to the initiator.</p>
<p>As above, if the payload types and transport candidate can be successfully used by both parties, then the initiator acknowledges the session-accept action.</p>
<p>The parties now begin to exchange media. In this case they would exchange audio using the Speex codec at a clockrate of 8000 since that is the highest-priority codec for the responder (as determined by the XML order of the &PAYLOADTYPE; children).</p>
<p>The media session proceeds. Now they would exchange both audio and video, where the audio is exchanged via the Speex codec at a clockrate of 8000 and the video is exchanged using the Theora codec with a height of 720 pixels, a width of 1280 pixels, and so on.</p>
<section2topic='Jingle Audio via SRTP, Negotiated with ICE-UDP'anchor='scenarios-srtp'>
<p>In this scenario, Romeo initiates a secure voice chat with Juliet using a transport method of ICE-UDP. The parties also exchange informational messages.</p>
<p>Because the parties have chosen the Jingle ICE-UDP Transport Method, the initiator and responder exchange an open-ended number of possible candidate transports, perform connectivity checks, and agree upon a candidate transport as explained in <cite>XEP-0176</cite>. Once ICE negotiation is completed, the responder sends a session-accept action to the initiator.</p>
<p>If the payload types and transport candidate can be successfully used by both parties, then the initiator acknowledges the session-accept action.</p>
<p>The parties now begin to exchange media. In this case they would exchange audio using the Speex codec at a clockrate of 8000 since that is the highest-priority codec for the responder (as determined by the XML order of the &PAYLOADTYPE; children).</p>
<p>The parties may continue the session as long as desired.</p>
<p>Eventually, one of the parties terminates the session.</p>
<p>For the sake of interoperability with a wide variety of free and open-source voice systems as well as deployment of patent-free technologies, support for the Speex codec is RECOMMENDED.</p>
<p>For the sake of interoperability with the public switched telephone network (PSTN) and most VoIP providers, support for the Pulse Code Modulation (PCM) codec defined in &ITU; recommendation G.711 is RECOMMENDED, including both the μ-law ("U-law") and A-law versions widely deployed in North America and Japan and in the rest of the world respectively.</p>
<p>If it is necessary to send Dual Tone Multi-Frequency (DTMF) tones in the content of audio exchanges, it is RECOMMENDED to use the XML format specified &xep0181;. However, an implementation MAY also support native RTP methods, specifically the "audio/telephone-event" and "audio/tone" media types.</p>
</section3>
<section3topic='When to Listen for Audio'anchor='impl-audio-listen'>
<p>When the Jingle RTP content type is accepted via a session-accept action, both initiator and responder SHOULD start listening for audio as defined by the negotiated transport method and audio application format. For interoperability with telephony systems, after the responder acknowledges the session initiation request, the responder SHOULD send a "ringing" message and both parties SHOULD play any audio received.</p>
<p>In order to secure the data stream, implementations SHOULD use encryption methods appropriate to the transport method and media being exchanged. Such encryption methods are out of scope for this specification.</p>
<p>Upon advancement of this specification, the ®ISTRAR; shall issue permanent namespaces in accordance with the process defined in Section 4 of &xep0053;.</p>
<p>The following namespaces are requested, and are thought to be unique per the XMPP Registrar's requirements:</p>
<p>For each RTP media type that an entity supports, it MUST advertise support for the "urn:xmpp:tmp:jingle:apps:rtp#[media]" feature, where the string "[media]" is replaced by the appropriate media type such as "audio" or "video".</p>
<p>The initial registry submission is as follows.</p>
<codecaption='Registry Submission'><![CDATA[
<var>
<name>urn:xmpp:tmp:jingle:apps:rtp#audio</name>
<desc>Signals support for audio sessions via RTP</desc>
<doc>XEP-0167</doc>
</var>
<var>
<name>urn:xmpp:tmp:jingle:apps:rtp#video</name>
<desc>Signals support for video sessions via RTP</desc>
<p>Thanks to Milton Chen, Diana Cionoiu, Olivier Crête, Tim Julien, Steffen Larsen, Jeff Muller, Mike Ruprecht, and Paul Witty for their feedback.</p>