<abstract>This specification defines a Jingle application type for negotiating one or more sessions that use the Real-time Transport Protocol (RTP) to exchange media such as voice or video. The application type includes a straightforward mapping to Session Description Protocol (SDP) for interworking with SIP media endpoints.</abstract>
<remark><p>Moved codec recommendations to a separate specification; harmonized session flows with XEP-0166; modified flow for combined audio/video scenario to use content-modify with senders attribute set to none for media pause and set to both for media resumption; clarified handling of description-info message.</p></remark>
<remark><p>Added ssrc attribute to description element; clarified handling with streaming transports; in accordance with list consensus, moved zrtp-hash to a separate specification; updated examples to reflect changes to XEP-0176.</p></remark>
<li>Because the modified encryption syntax is not backwards-compatible, incremented protocol version from 0 to 1 and changed namespace from urn:xmpp:jingle:apps:rtp:zero to urn:xmpp:jingle:apps:rtp:1.</li>
<li>Defined handling of early media, including mappings to RFC 3959 and RFC 3960 using the newly-defined 'disposition' attribute for the <content/> element in XEP-0166.</li>
<remark><p>Added name attribute to active element to mirror usage for mute element; clarified meaning of session in the context of this specification; recommended that all sessions established via the same Jingle negotiation should be treated as synchronized.</p></remark>
<remark><p>In accordance with list consensus, generalized to cover all RTP media, not just audio; corrected text regarding payload types sent by responder in order to match SDP approach.</p></remark>
<remark><p>Removed info message for busy since it is now a Jingle-specific error condition defined in XEP-0166; defined info message for active.</p></remark>
<remark><p>Specified Jingle conformance, including the preference for datagram transports over streaming transports and the process of sending and receiving audio content over each transport type.</p></remark>
<remark><p>Renamed to mention RTP as the associated transport; corrected negotiation flow to be consistent with SIP/SDP (each party specifies a list of the payload types it can receive); added profile attribute to content element in order to specify RTP profile in use.</p></remark>
<remark><p>Specified how to include SDP parameters and codec-specific parameters; clarified negotiation process; added Speex examples; removed queued info message.</p></remark>
<remark><p>Defined info message for busy; added info message examples; recommended use of Speex; updated schema and XMPP Registrar considerations.</p></remark>
<p>&xep0166; can be used to initiate and negotiate a wide range of peer-to-peer sessions. One session type of interest is media such as voice or video. This document specifies an application format for negotiating Jingle media sessions, where the media is exchanged over the Realtime Transport Protocol (RTP; see &rfc3550;).</p>
<p>In accordance with Section 10 of <cite>XEP-0166</cite>, this document specifies the following information related to the Jingle RTP application type:</p>
<li><p>The application format negotiation process is defined in the <linkurl='#negotiation'>Negotiating a Jingle RTP Session</link> section of this document.</p></li>
<li><p>A mapping of Jingle semantics to the Session Description Protocol is provided in the <linkurl='#sdp'>Mapping to Session Description Protocol</link> section of this document.</p></li>
<li><p>A Jingle RTP session SHOULD use a datagram transport method (e.g. &xep0177; or the "ice-udp" method specified in &xep0176;), but MAY use a streaming transport if the end-to-end link has minimal latency and the media negotiated is not unduly heavy (e.g., it might be possible to use a streaming transport for audio, but not for video).</p></li>
<li><p>Jingle RTP requires two components: one for RTP itself and one for the Real Time Control Protocol (RTCP). The component numbered "1" MUST be associated with RTP and the component numbered "2" MUST be associated with RTCP. Even if an implementation does not support RTCP, it MUST accept Jingle content types that include component "2" by mirroring the second component in its replies (however, it would simply ignore the RTCP-related data during the RTP session).</p></li>
<li><p>For datagram transports, outbound content shall be encoded into RTP packets and each packet shall be sent individually over the transport. Each inbound packet received over the transport is an RTP packet.</p></li>
<li><p>For streaming transports, outbound content shall be encoded into RTP packets, framed in accordance with &rfc4571;, and sent in succession over the transport. Incoming data received over the transport shall be processed as a stream of RTP packets, where each RTP packet boundary marks the location of the next packet.</p></li>
<p>A Jingle RTP session is described by a content type that contains one application format and one transport method. Each <content/> element defines a single RTP session. A Jingle negotiation MAY result in the establishment of multiple RTP sessions (e.g., one for audio and one for video). An application SHOULD consider all of the RTP sessions that are established via the same Jingle negotiation to be synchronized for purposes of streaming, playback, recording, etc.</p>
<p>The application format consists of one or more encodings contained within a wrapper <description/> element qualified by the 'urn:xmpp:jingle:apps:rtp:1' namespace &VNOTE;. In the language of <cite>RFC 4566</cite> each encoding is a payload-type; therefore, each <payload-type/> element specifies an encoding that can be used for the RTP stream, as illustrated in the following example.</p>
<p>The &DESCRIPTION; element is intended to be a child of a Jingle &CONTENT; element as specified in <cite>XEP-0166</cite>.</p>
<p>The &DESCRIPTION; element MUST possess a 'media' attribute that specifies the media type, such as "audio" or "video", where the media type SHOULD be as registered at &ianamedia;.</p>
<p>The &DESCRIPTION; element MAY possess a 'ssrc' attribute that specifies the 32-bit synchronization source for this media stream, as defined in <cite>RFC 3550</cite>.</p>
<p>After inclusion of one or more &PAYLOADTYPE; child elements, the &DESCRIPTION; element MAY also contain a <bandwidth/> element that specifies the allowable or preferred bandwidth for use by this application type. The 'type' attribute of the <bandwidth/> element SHOULD be a value for the SDP "bwtype" parameter as listed in the &ianasdp;. For RTP sessions, often the <bandwidth/> element will specify the "session bandwidth" as described in Section 6.2 of <cite>RFC 3550</cite>, measured in kilobits per second as described in Section 5.2 of <cite>RFC 4566</cite>.</p>
<p>The encodings SHOULD be provided in order of preference by placing the most-preferred payload type as the first &PAYLOADTYPE; child of the &DESCRIPTION; element and the least-preferred payload type as the last child.</p>
<p>In Jingle RTP, the encodings are used in the context of RTP. The most common encodings for the Audio/Video Profile (AVP) of RTP are listed in &rfc3551; (these "static" types are reserved from payload ID 0 through payload ID 95), although other encodings are allowed (these "dynamic" types use payload IDs 96 to 127) in accordance with the dynamic assignment rules described in Section 3 of <cite>RFC 3551</cite>. The payload IDs are represented in the 'id' attribute.</p>
<p>Each <payload-type/> element MAY contain one or more child elements that specify particular parameters related to the payload. For example, as described in &rtpspeex;, the "cng", "mode", and "vbr" parameters can be specified in relation to usage of the Speex <note>See <<linkurl='http://www.speex.org/'>http://www.speex.org/</link>>.</note> codec. Where such parameters are encoded via the "fmtp" SDP attribute, they shall be represented in Jingle via the following format:</p>
<p>Parameter names MUST be treated as case-sensitive.</p>
<p>Note: Parameter names are effectively guaranteed to be unique, since &IANA; maintains a registry of SDP parameters (see <<linkurl='http://www.iana.org/assignments/sdp-parameters'>http://www.iana.org/assignments/sdp-parameters</link>>).</p>
<p>When the initiator sends a session-initiate message to the responder, the &DESCRIPTION; element includes all of the payload types that the initiator can send and/or receive for Jingle RTP, each one encapsulated in a separate &PAYLOADTYPE; element (the rules specified in &rfc3264; SHOULD be followed regarding inclusion of payload types).</p>
<p>Upon receiving the session-initiate stanza, the responder determines whether it can proceed with the negotiation. The general Jingle error cases are specified in <cite>XEP-0166</cite> and illustrated in the <linkurl='#scenarios'>Scenarios</link> section of this document.</p>
<p>Depending on user preferences or client configuration, a user agent controlled by a human user might need to wait for the user to affirm a desire to proceed with the session before continuing. When the user agent has received such affirmation (or if the user agent can automatically proceed for any reason, e.g. because no human intervention is expected or because a human user has configured the user agent to automatically accept sessions with a given entity), it returns a Jingle session-accept message. The session-accept message SHOULD include a subset of the payload types sent by the initiator, i.e., a list of the offered payload types that the responder can send and/or receive. The list that the responder sends SHOULD retain the ID numbers specified by the initiator. The order of the &PAYLOADTYPE; elements indicates the responder's preferences, with the most-preferred type first.</p>
<p>In the following example, we imagine that the responder supports Speex at a clockrate of 8000 but not 16000, G729, and PCMA but not PMCU. Therefore the responder returns only two payload types (since PMCA was not offered).</p>
<p>Note: If the responder supports none of the payload-types offered by the initiator, the responder SHOULD terminate the session and include a Jingle reason of <failed-application/>.</p>
<p>If the responder accepts the session, the initiator acknowledges the session-accept message:</p>
<p>The initiator and responder would then attempt to establish connectivity for the data channel, Once they do, they would exchange media using any of the codecs that meet the following criteria:</p>
<p>The SDP media type for Jingle RTP is "audio" (see Section 8.2.1 of <cite>RFC 4566</cite>) for audio media, "video" (see Section 8.2.1 of <cite>RFC 4566</cite>) for video media, etc. The media type is reflected in the Jingle 'media' attribute.</p>
<p>The Jingle <bandwidth/> element SHALL be mapped to an SDP b= line; in particular, the value of the 'type' attribute shall be mapped to the SDP <bwtype> parameter and the XML character data of the Jingle <bandwidth/> element shall be mapped to the SDP <bandwidth> parameter.</p>
<p>If the payload type is static (payload-type IDs 0 through 95 inclusive), it MUST be mapped to an m= line as defined in <cite>RFC 4566</cite>. The generic format for this line is as follows:</p>
<p>The SDP <media> parameter is "audio" or "video" or some other media type as specified by the Jingle 'media' attribute, the <port> parameter is the preferred port for such communications (which might be determined dynamically), and the <fmt list> parameter is the payload-type ID.</p>
<p>If the payload type is dynamic (payload-type IDs 96 through 127 inclusive), it SHOULD be mapped to an SDP media field plus an SDP attribute field named "rtpmap".</p>
<p>For example, consider a payload of 16-bit linear-encoded stereo audio sampled at 16KHz associated with dynamic payload-type 96:</p>
<p>As noted, if additional parameters are to be specified, they shall be represented as attributes of the <parameter/> child of the &PAYLOADTYPE; element, as in the following example.</p>
<p>The term "early media" refers to media that is exchanged before a responder has definitively accepted a session request generated by an initiator or before end-to-end connectivity has been established (e.g., the media could be generated by an intermediate call manager or media relay). Early media is typically used to send ringing tones and announcements, using either audio streams or Dual Tone Multi-Frequency (DTMF) events.</p>
<p>In Jingle, the exchange of early media is established through use of the "content-add" action. In order to match the usage specified in &rfc3959; and &rfc3960;, when adding a content definition for early media the value of the &CONTENT; element's 'disposition' attribute MUST be "early-session" for mapping to a SIP Content-Disposition header value of "early-session". This enables endpoints or intermediate gateways to apply the application server model described in <cite>RFC 3960</cite>.</p>
<p>An entity that generates a content-add message for early media SHOULD specify the same codecs for both session media and early media (however, it is possible that the entity that generates the early media does not generate the session media, for example in the case of an intermediate gateway or application server; in this case the entity MUST use one of the codecs advertised by the initiator).</p>
<p>Upon receiving a content-add message specifying the use of early media, the initiator's client SHOULD acknowledge the content-add, complete any required transport negotiation, and then send a content-accept (or content-reject) to the sender. When the responder subsequently sends a session-accept message, the acceptance MUST NOT be construed to include the content definition whose disposition is "early-session".</p>
<p>In handling early media and deciding whether to generate local ringing or to play early media received from the responder or an intermediate gateway, the initiator's client SHOULD proceed as follows:</p>
<li>Once the responder has accepted the session and the session data (as opposed to early session data) has begun to flow, stop local ringing or stop playing early media.</li>
<p>&rfc3711; defines the Secure Real-time Transport Protocol, and &rfc4568; defines the SDP "crypto" attribute for signalling and negotiating the use of SRTP in the context of offer-answer protocols such as SIP. To enable the use of SRTP and gatewaying to non-XMPP technologies that make use of the "crypto" SDP attribute, we define a corresponding <crypto/> element qualified by the 'urn:xmpp:jingle:apps:rtp:1' namespace.</p>
<p>If the initiator wishes to use SRTP, the session-initiate stanza shall include an <encryption/> element, which MUST contain at least one <crypto/> element and MAY include multiple instances of the <crypto/> element. The <encryption/> element MUST be a child of the <description/> element. If the initiator requires the session to be encrypted, the <encryption/> element MUST include a 'required' attribute whose logical value is TRUE and whose lexical value is "true" or "1" &BOOLEANNOTE;, where this attribute defaults to a logical value of FALSE (i.e., a lexical value of "false" or "0").</p>
<p>The <crypto/> element is defined as empty (i.e., not containing any child elements); the XML attributes of the <crypto/> element are as follows:</p>
<li>crypto-suite -- this maps to the SDP "crypto-suite" parameter and has the same semantics (i.e., it is an identifier that describes the encryption and authentication algorithms).</li>
<li>key-params -- this maps to the SDP "key-params" parameter and has the same semantics (i.e., it provides one or more sets of keying material for the crypto-suite in question).</li>
<li>session-params -- this maps to the SDP "session-params" parameter and has the same semantics (i.e., it provides transport-specific parameters for SRTP negotiation).</li>
<li>tag -- this maps to the SDP "tag" parameter and has the same semantics (i.e., it is a decimal number used as an identifier for a particular crypto element).</li>
<p>When the responder receives a session-initiate message containing an <encryption/> element, the responder MUST do one of the following:</p>
<ol>
<li>Attempt to proceed with an encrypted session by including the acceptable credentials (i.e., the relevant <crypto/> element) in its session-accept message.</li>
<li>Attempt to proceed with an unencrypted session by not including any <crypto/> element in its session-accept message (it is up to the initiator to reject this attempt if desired).</li>
<li>Reject the initiator's offer by sending a session-terminate message with a Jingle reason of <security-error/> (typically with an RTP-specific condition of <invalid-crypto/>).</li>
</ol>
<p>Which of these the responder does is a matter of personal security policies or client configuration.</p>
<p>If the responder requires encryption but the initiator did not include an <encryption/> element in its offer, the responder MUST reject the offer by sending a session-terminate message with a Jingle reason of <security-error/> and an RTP-specific condition of <crypto-required/>.</p>
<p>If the initiator requires encryption but the responder does not include an <encryption/> element in its session acceptance, the initiator MUST terminate the session with a Jingle reason of <security-error/> and an RTP-specific condition of <crypto-required/>.</p>
<p>Informational messages can be sent by either party within the context of Jingle to communicate the status of a Jingle RTP session, device, or principal. The informational message MUST be an IQ-set containing a &JINGLE; element of type "session-info", where the informational message is a payload element qualified by the 'urn:xmpp:jingle:apps:rtp:info:1' namespace. The following payload elements are defined. <note>A <trying/> element (equivalent to the SIP 100 Trying response code) is not necessary, since each session-level message is acknowledged via XMPP IQ semantics.</note></p>
<p>Note: Because an informational message is sent in an IQ-set, the receiving party MUST return either an IQ-result or an IQ-error (normally an IQ-result simply to acknowledge receipt).</p>
<p>The <active/> payload indicates that the principal or device is again actively participating in the session after having been on mute or having put the other party on hold. The <active/> element applies to all aspects of the session, and thus does not possess a 'name' attribute.</p>
<p>The <hold/> payload indicates that the principal is temporarily not listening for media from the other party. It is RECOMMENDED for the parties to handle informational <hold/> messages as follows (where the holdee is the party that receives the hold message and the holder is the party that sends the hold message):</p>
<li>The holdee MUST keep accepting media (this ensures that the holder can immediately start sending media again when switching back from hold to active, or can send hold music or other media).</li>
<li>The holder MAY continue to send media (e.g. hold music).</li>
<li>The holder MAY silently drop all media that it receives from the holdee.</li>
<p>The <mute/> payload indicates that the principal is temporarily not sending media to the other party but continuing to accept media from the other party. The <mute/> element MAY possess a 'name' attribute whose value specifies a particular session to be muted (e.g., muting the audio aspect but not the video aspect of a voice+video chat). If no 'name' attribute is included, the recipient MUST assume that all sessions are to be muted.</p>
<p>The <ringing/> payload indicates that the device is ringing but the principal has not yet interacted with it to answer (this maps to the SIP 180 response code).</p>
<p>Before or during an RTP session, either party can share suggested application parameters with the other party by sending a Jingle stanza with an action of "description-info". The stanza shall contain only a &DESCRIPTION; element, which specifies suggested parameters for a given application type (e.g., a change to the height and width for display of a video stream). An example follows.</p>
<p>The description-info message SHOULD include only the modified codecs, not the complete set of codecs (if those codecs have not changed). Their order is NOT meaningful. Furthermore, the data provided is purely advisory; the session SHOULD NOT fail if the receiving party cannot adjust its parameters accordingly.</p>
<p>To advertise its support for Jingle RTP Sessions and specific media types for RTP, when replying to &xep0030; information requests an entity MUST return the following features:</p>
<ul>
<li>URNs for any version of this protocol that the entity supports -- e.g., "urn:xmpp:jingle:apps:rtp:1" for this version and "urn:xmpp:jingle:apps:rtp:0" for the previous version &VNOTE;</li>
<li>URNs for all of the media types that the entity supports -- e.g., "urn:xmpp:jingle:apps:rtp:audio" for RTP audio and "urn:xmpp:jingle:apps:rtp:video" for RTP video <note>Support for the "audio" or "video" media type does not necessarily mean that the application supports all sub-types associated with those media types.</note></li>
<p>In order for an application to determine whether an entity supports this protocol, where possible it SHOULD use the dynamic, presence-based profile of service discovery defined in &xep0115;. However, if an application has not received entity capabilities information from an entity, it SHOULD use explicit service discovery instead.</p>
<p>Note: It might be wondered why the responder does not accept the session and then terminate. That order would be acceptable, too, but here we assume that the responder's client has immediate information about the responder's free/busy status (e.g., because the responder is on the phone) and therefore returns an automated busy signal without requiring user interaction.</p>
<p>In this scenario, Romeo initiates a voice chat with Juliet using a transport method of ICE-UDP. The parties also exchange informational messages.</p>
<p>Once connectivity is established (which might necessitate the exchange of additional candidates via transport-info messages), the parties begin to exchange media. In this case they would use RTP to exchange audio using the Speex codec at a clockrate of 8000 since that is the highest-priority codec for the responder (as determined by the XML order of the &PAYLOADTYPE; children).</p>
<section2topic='Jingle Audio via SRTP, Negotiated with ICE-UDP'anchor='scenarios-srtp'>
<p>In this scenario, Romeo initiates a secure voice chat with Juliet using a transport method of ICE-UDP. The parties also exchange informational messages.</p>
<p>To signal that the initiator wishes to use SRTP, the initiator's client includes keying material via the <encryption/> element (with one set of keying material per <crypto/> element). Here the initiator also signals that encryption is mandatory via the 'required' attribute.</p>
<p>If the keying material is acceptable, the responder's continues with the negotiation. If the keying material is not acceptable, the responder's client terminates the session as described under <linkurl='#srtp'>Negotiation of SRTP</link>.</p>
<p>As soon as possible, the responder's client sends a session-accept message to the initiator. In this case, the session-accept message includes a <crypto/> element to indicate that the responder finds the offered keying material acceptable.</p>
<p>Once connectivity is established (which might necessitate the exchange of additional candidates via transport-info messages), the parties begin to exchange media. In this case they would use RTP to exchange audio using the Speex codec at a clockrate of 8000 since that is the highest-priority codec for the responder (as determined by the XML order of the &PAYLOADTYPE; children).</p>
<section2topic='Jingle Audio via RTP with Early Media'anchor='scenarios-earlymedia'>
<p>In this scenario, Romeo initiates a voice chat with Juliet using a transport method of ICE-UDP. There is a gateway between Romeo and Juliet, and the gateway functions as an application server by returning early media to Romeo (perhaps some late medieval hold music or an old-fashioned IVR interaction). To simplify the flow, we have left out any ringing notifications generated by Juliet.</p>
<p>Now the gateway sends a content-add message to Romeo while waiting for Juliet to pay attention to her telephony interface. It specifies a transport method of Raw UDP because it hosts its own media relay.</p>
<p>Because the gateway (on behalf of the responder) specified a transport method of Raw UDP for the early session data, the initiator then might also send a Raw UDP candidate to the gateway in a transport-info message (see <cite>XEP-0177</cite> for details).</p>
<p>Because Romeo has attempted to send test media to the gateway as described in <cite>XEP-0177</cite>, he has exposed an IP/port to which the gateway can now send early media via the media relay that it hosts.</p>
<p>Once end-to-end connectivity is established (which might necessitate the exchange of additional candidates via transport-info messages), the parties begin to exchange media; as a result, Romeo and the gateway terminate the exchange of early media (this does not necessitate exchange of a content-remove message, since the endpoint and the gateway can simply stop sending media).</p>
<p>In this scenario, Romeo initiates an audio chat with Juliet using a transport method of ICE-UDP. After the chat is active, Romeo adds video. The parties also exchange various informational messages</p>
<p>Once end-to-end connectivity is established (which might necessitate the exchange of additional candidates via transport-info messages), the parties begin to exchange media. In this case they would use RTP to exchange audio using the Speex codec at a clockrate of 8000 since that is the highest-priority codec for the responder (as determined by the XML order of the &PAYLOADTYPE; children).</p>
<p>Romeo, being an amorous young man, requests to add video to the audio chat.</p>
<p>However, Juliet is having a bad hair day, so she sends a content-modify message to Romeo that sets the senders to "initiator" (she doesn't mind receiving video but she doesn't want to send video).</p>
<p>If necessary, the parties then negotiate transport methods and codec parameters for video, and Romeo starts to send video to Juliet, where the video is exchanged using the Theora codec with a height of 600 pixels, a width of 800 pixels, a bandwidth limit of 128,000 kilobits per second, etc.</p>
<p>After some number of minutes, Juliet returns and agrees to start sending video, too.</p>
<p>Other events might occur throughout the life of the session. For example, one of the parties might want to tweak the video parameters using a description-info action.</p>
<examplecaption="Initiator sends changes to application parameters"><![CDATA[
<p>XMPP applications that use Jingle RTP sessions for voice chat MUST support and prefer native RTP methods of communicating DTMF information, in particular the "audio/telephone-event" and "audio/tone" media types. It is NOT RECOMMENDED to use the protocol described in &xep0181; for communicating DTMF information with RTP-aware endpoints.</p>
<section2topic='When to Listen for Audio'anchor='impl-listen'>
<p>When the Jingle RTP content type is accepted via a session-accept action, both initiator and responder SHOULD start listening for audio as defined by the negotiated transport method and audio application format. For interoperability with telephony systems, after the responder acknowledges the session initiation request, the responder SHOULD send a "ringing" message and both parties SHOULD play any audio received. For more detailed suggestions in the context of early media, see under <linkurl='#earlymedia'>Early Media</link>.</p>
<p>In order to secure the data stream, implementations SHOULD use encryption methods appropriate to the RTP data transport. It is RECOMMENDED to use SRTP as defined in the <linkurl='#srtp'>Negotiation of SRTP</link> section of this document. The SRTP keying material SHOULD (1) be tied to a separate, secure connection such as provided by DTLS (&rfc4347;) where the keys are established as described in &dtlssrtp; and/or (2) protected by sending the Jingle signalling over a secure channel that protects the confidentiality and integrity of the SRTP-related signalling data.</p>
<p>While it is also possible to use native RTP methods, such as &zrtp; as described at <<linkurl='http://xmpp.org/extensions/inbox/jingle-zrtp.html'>http://xmpp.org/extensions/inbox/jingle-zrtp.html</link>>, this specification does not actively encourage or discourage the use of such methods.</p>
<p>Upon advancement of this specification from a status of Experimental to a status of Draft, the ®ISTRAR; shall add the foregoing namespaces to the registry located at &NAMESPACES;, as described in Section 4 of &xep0053;.</p>
<p>For each RTP media type that an entity supports, it MUST advertise support for the "urn:xmpp:jingle:apps:rtp:[media]" feature, where the string "[media]" is replaced by the appropriate media type such as "audio" or "video".</p>
<p>Thanks to Milton Chen, Paul Chitescu, Olivier Crête, Tim Julien, Steffen Larsen, Jeff Muller, Mike Ruprecht, Sjoerd Simons, Will Thompson, Justin Uberti, Unnikrishnan Vikrama Panicker, and Paul Witty for their feedback.</p>