xeps/xep-0340.xml

681 líneas
28 KiB
XML

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE xep SYSTEM 'xep.dtd' [
<!ENTITY % ents SYSTEM 'xep.ent'>
%ents;
]>
<?xml-stylesheet type='text/xsl' href='xep.xsl'?>
<xep>
<header>
<title>COnferences with LIghtweight BRIdging (COLIBRI)</title>
<abstract>This specification defines an XMPP extension that allows
real-time communications clients to discover and interact with
conference bridges that provide conference mixing or relaying
capabilities.
</abstract>
&LEGALNOTICE;
<number>0340</number>
<status>Deferred</status>
<type>Standards Track</type>
<sig>Standards</sig>
<approver>Council</approver>
<dependencies>
<spec>XEP-0167</spec>
</dependencies>
<supersedes/>
<supersededby/>
<shortname>colibri</shortname>
<discuss>jingle</discuss>
<author>
<firstname>Emil</firstname>
<surname>Ivov</surname>
<org>jitsi.org</org>
<email>emcho@jitsi.org</email>
<jid>emcho@sip-communicator.org</jid>
</author>
<author>
<firstname>Lyubomir</firstname>
<surname>Marinov</surname>
<org>jitsi.org</org>
<email>lubo@jitsi.org</email>
<jid>lubo@sip-communicator.org</jid>
</author>
&fippo;
<revision>
<version>0.2</version>
<date>2017-09-11</date>
<initials>XEP Editor (jwi)</initials>
<remark>Defer due to lack of activity.</remark>
</revision>
<revision>
<version>0.1</version>
<date>2014-01-08</date>
<initials>psa</initials>
<remark><p>Initial published version approved by the XMPP Council.</p></remark>
</revision>
<revision>
<version>0.0.2</version>
<date>2013-12-04</date>
<initials>ei/ph</initials>
<remark>
<p>Added usecases.</p>
</remark>
</revision>
<revision>
<version>0.0.1</version>
<date>2013-05-30</date>
<initials>ei/lm</initials>
<remark>
<p>First draft.</p>
</remark>
</revision>
</header>
<section1 topic='Introduction' anchor='intro'>
<p>
&xep0298; defines a way for XMPP agents to establish and
participate in tightly coupled conference calls. Such conference
calls would typically involve a number of regular participants
that establish direct one-to-one sessions with a single entity,
often referred to as a focus agent. Focus agents are generally
responsible for making sure that media sent from one call
participant would be distributed to all others so that everyone
would effectively hear or see everyone else. In other words they
often act as media mixers.
</p>
<p>
Depending on the mixing technology used by media mixers, they may
require significant bandwidth, processing resources or both. It is
hence common for mixers to be hosted on dedicated servers that can
provide such resources. They are then made reachable as
rendez-vous points and conference call participants are required
to call in, in order to join a conference call. This requires a
certain amount of pre-call configuration to be completed by the
service maintainers in order to create conference rooms and grant
proper access to the expected participants. The authorization
credentials are then often relayed to the participants in
preparation of the call by other means, such as IM or mail.
</p>
<p>
In certain situations, such pre-call preparations are
inconvenient and it is important for users to be able to establish
ad-hoc conference calls. One way to achieve this is for user
agents themselves to act as focus agents and media mixers.
Everyone else just calls the user at the focus agent, who then can
decide whether to accept or reject the calls as they arrive. This
works particularly well for audio only calls as the amount of
bandwidth and processing resources that they require is generally
within reach for end-user devices.
</p>
<p>
The situation is quite different for video calls. Media decoding
and especially encoding require considerably more resources with
video than they do with audio. Today, encoding a single video flow
with an acceptable quality is often the maximum that can be
expected from an end-user device. The advantages that come with
Moore's law will likely be insufficient to improve this, given the
massive shift toward mobile devices and the ever-increasing user
expectations toward video quality.
</p>
<p>
Therefore, this specification (COLIBRI) aims to provide a means
for user agents to interact with conference mixers. Such
interaction allows user agents to allocate mixing channels,
indicate what conferences they should be attached to, what
integers the various payload types map to, etc. Using COLIBRI
would hence allow any user agent to organize conference calls and
act as a signalling focus by outsourcing the actual media mixing
to a dedicated mixer.
</p>
</section1>
<section1 topic='Terminology' anchor='terms'>
<dl>
<di>
<dt>
Focus or Focus Agent
</dt>
<dd>
The terms apply to XMPP agents that, in terms of signalling,
stand at the center of a tightly-coupled conference call. In
other words, all conference participants establish a &xep0166;
session with and only with that agent. Focus Agents are not
necessarily performing media mixing themselves. In fact, the
very purpose of this specification is to provide them with a
means of handling this elsewhere.
</dd>
</di>
<di>
<dt>
Mixer or Bridge
</dt>
<dd>
Throughout this document the term is used to depict an
entity that is responsible for mixing and delivering to
conference participants all media exchanged in a conference
call. Mixers or bridges can provide their service by either
performing Content Mixing, or RTP translation or both.
</dd>
</di>
<di>
<dt>
Content Mixing
</dt>
<dd>
The term refers to a kind of media processing where the
content of multiple input RTP streams is "mixed" into a single
output stream. In conference calls audio mixing
implementations generally simply add and adjust all source
audio streams to produce their single output stream. Video
content mixing, on the other hand, is often implemented by
creating composite images containing individual frames from
the input streams. Another common implementation consists in
producing an output that is identical to one of the input
streams, often the one belonging to a currently active
speaker.
</dd>
</di>
<di>
<dt>
RTP Translation
</dt>
<dd>
<cite>RFC 3550</cite> defines a translator as "an intermediate
system that forwards RTP packets with their synchronization
source identifier intact." This specification respects that
definition but it also uses "RTP Translation" in opposition
with "Content Mixing". Conference bridges that perform RTP
translation simply redirect each incoming RTP packet to all
participants, often excluding the one where it came from.
Contrary to content mixing, rtp translation generally requires
less processing resources since it does not involve media
manipulation. Bandwidth requirements on the other hand, could
be significantly higher with RTP translation than with content
mixing.
</dd>
</di>
</dl>
</section1>
<section1 topic='Requirements' anchor='reqs'>
<p>
The extension defined herein is designed to meet the following
requirements:
</p>
<ol>
<li>
Provide a means for conference focus agents to interact with
conference mixers in order to configure payload type mappings
and allocate ports or other resources that they could then
advertise in Jingle sessions so that all media would traverse
the bridge.
</li>
<li>Impose no COLIBRI specific requirements on non-focus
participants so that any Jingle supporting client would be able
to participate.
</li>
<li>[TODO] Anything else?</li>
</ol>
</section1>
<section1 topic='How It Works' anchor='howitworks'>
<p>
This section provides a friendly introduction to COLIBRI.
</p>
<p>
In essence, the goal of COLIBRI is to provide focus agents with a
way of using remote mixers as if as they were available locally.
The most important part of that is the possibility to allocate
ports on the mixer interfaces and then use these ports when
establishing Jingle sessions with the various participants.
</p>
<p>
Every participant in the conference call is assigned one port for
RTP data and one for RTCP. An RTP/RTCP port couple is called a
channel. Each participant would use one channel per media type.
That is, a client participating with audio only would get one
channel, while another one that joins with both audio and video
would get two: one for audio and one for video.
</p>
<p>
Channels are used for streams from the bridge to participants and
from participants to the bridge. Typically a channel would contain
one stream from a participant to a bridge, for example their
webcam or desktop, and one or more streams in the opposite
direction (e.g. webcam or desktop streams from other participants
to the one using this channel). This is not a requirement though
and a channel can certainly be used for transportation of multiple
streams in both directions in cases where one bridge is connected
to another.
</p>
<p>
Typically channels would be created by the entity controlling a
conference call. This could either be a conferencing server or a
smart client capable of handling conferences. We would refer to
both of these as the "focus". In either case, the important part
is that the focus terminates all signalling. It is a signalling
endpoint and it is responsible for all aspects of call signalling
including offer/answer.
</p>
<p>
In other words, when setting up a conference, a focus would first
allocate the necessary channels, then directly initiate sessions
(invite) other participants into the call. Only, when sending the
invitations to these participants, the focus would use the
transport information (addresses and ports) that it would have
received from the COLIBRI bridge, rather than its own.
</p>
<!--
discovery
creating a conference call
modifying a conference call
defining payload types
supports transcoding
get supported formats
Mixing vs. RTP translation
RTP/RTCP channels
RAW UDP
ICE UDP
ZRTP
composited / switched
-->
</section1>
<section1 topic='Use cases' anchor='usecases'>
<section2 topic='Creating a conference' anchor='usecases-create'>
<p>
The most important thing about setting up a conference is the
creation of channels for every participant. Conference setup is
not the only chance an organiser gets to declare all
participants but typically when a conference call is setup it
is because there are at least some number of known participants
and there would be no point in delaying channel creation for
them.
</p>
<p>
The following example shows how Romeo creates an audio/video
conference at a bridge, requesting that three participant
channels be created.
</p>
<example caption='Creating a conference'><![CDATA[
SEND: <iq from='romeo@montague.lit/orchard'
id='zid615d9'
to='garden.montague.lit'
type='set'>
<conference xmlns='http://jitsi.org/protocol/colibri'>
<content name='audio'>
<channel initiator='true'>
[optional payload and transport description]
</channel>
<channel initiator='true'/>
...
<channel initiator='true'/>
</content>
<content name='video'>
<channel initiator='true'/>
...
<channel initiator='true'/>
</content>
</conference>
</iq>
]]></example>
<p>
Notice how the 'initiator' channel above is set to true. The
setting determines ICE and DTLS/SRTP behaviour for the bridge.
In this specific case, 'initiator' being set to 'true' Romeo is
requesting that the bridge behave as the initiator of the
session which means that it would try to be the controlling ICE
agent and also assume the 'dtls-actpass' role for DTLS/SRTP
negotiation. A value of 'false' would have meant that the bridge
would behave as the controlled ICE agent and assume the
'dtls-active' role.
</p>
<p>
When sending its result back, the bridge confirms creation of
the requested channels and it also delivers transport
information that would be necessary for participants to
transport media to the bridge. These would most often include
ICE candidates, ufrag and pwd parameters, and DTLS fingerprints.
</p>
<p>
Note that ICE is not mandatory for use and COLIBRI bridges can
just as well perform Hosted NAT Traversal using latching and a
RAW-UDP transport.
</p>
<example caption='Result'><![CDATA[
RECV: <iq type='result' to='romeo@montague.lit/orchard'
from='garden.montague.lit'
id='zid615d9'>
<conference xmlns='http://jitsi.org/protocol/colibri'
id='cafb6f2c8197818e'>
<content name='audio'>
<channel rtp-level-relay-type='mixer' direction='recvonly'
initiator='true' id='c6a142b7cf728fd0' expire='60'>
<source xmlns='urn:xmpp:jingle:apps:rtp:ssma:0' ssrc='3716204482'/>
<transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'
pwd='5d0mj' ufrag='amiq'>
<fingerprint xmlns='urn:xmpp:jingle:apps:dtls:0' hash='sha-1'>
99:...:F6
</fingerprint>
<candidate type='host' protocol='udp' id='ca' ip='10.0.1.1'
component='1' port='5144' foundation='3' generation='0'
priority='2113932031' network='0'/>
<candidate type='host' protocol='udp' id='ga' ip='10.0.1.1'
component='2' port='5145' foundation='3' generation='0'
priority='2113932030' network='0'/>
</transport>
</channel>
...
</content>
<content name='video'>
<channel rtp-level-relay-type='mixer' direction='recvonly'
initiator='true' id='c9726594ccb4ede7' expire='60'>
...
</channel>
...
</content>
</conference>
</iq>
]]></example>
<p>
The above "result" also contains the following elements of
interest for every channel:
</p>
<p>
- an ID that is necessary for any further modification from that
the focus wants to set on a channel. [FIXME: clients should be
able to specify these id-s so as not to rely on ordering to
identify channels and get complexes thinking they are SDP
parsers]
</p>
<p>
- an rtp-level-relay-type attribute with possible values of
'mixer' and 'translator' indicating how the bridge is going to
deliver data on a specific channel [FIXME: this would definitely
need to be specifiable from the client].
</p>
<p>
- mixer channels would also include ssrc-s for that channels in
question. This is particularly necessary when SSRC-s need to be
announced to participants (because people never learned how RTP
works and are afraid from anything that wasn't explicitly
announced with an Offer/Answer exchange). Generally such
announcements would be possible by simply propagating SSRCs that
other participants announce. In a mixed flow however the SSRC
would belong to the mixer (or COLIBRI bridge) so it would need
to be known in advance.
attribute
</p>
<p>
- the initiator value is echoed
</p>
<p>
- expire describes how many seconds the bridge will keep the
channel open without media activity
</p>
</section2>
<section2 topic='Updating payload information for a channel' anchor='usecases-update-payload'>
<p>
Channel updates can happen for various reasons. The following
examples illustrate two of them:
</p>
<p>
- specifying payload types. While payload types in RTP are
sometimes static (e.g. for older codecs such as G.711), this is
not always the case for more recent types, which need to be
assigned dynamically during session establishment. The tricky
part here is that dynamic means dynamic so every participant in
a conference call may end up expecting different payload types.
As a result, a COLIBRI bridge SHOULD know about everyone's
expectations, which is why channels are updated with payload
types. Note that if a bridge does see unknown payload types it
MUST still relay them to other participants as they might have
used some other mechanism to make sure they know what they
mean.
</p>
<p>
- DTLS/SRTP fingerprints.
</p>
<example caption='Focus updates payload information for a channel'>
<![CDATA[
SEND: <iq to='garden.montague.lit' from='romeo@montague.lit/orchard'
type='set' id='74s'>
<conference xmlns='http://jitsi.org/protocol/colibri'
id='cafb6f2c8197818e'>
<content name='audio'>
<channel id='c6a142b7cf728fd0' initiator='true'>
<payload-type id='111' name='opus' clockrate='48000' channels='2'/>
<payload-type id='0' name='PCMU' clockrate='8000' channels='1'/>
<payload-type id='8' name='PCMA' clockrate='8000' channels='1'/>
<transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'
ufrag='WP+qAQZGnDhhM+87' pwd='0Uxdzy9gTryxAkmAx2LD1TYR'>
<fingerprint hash='sha-256' xmlns='urn:xmpp:jingle:apps:dtls:0'
setup='active'>
08:...:C7
</fingerprint>
</transport>
</channel>
</content>
<content name='video'>
<channel id='c9726594ccb4ede7' initiator='true'>
<payload-type id='100' name='VP8' clockrate='90000' channels='1'/>
<payload-type id='116' name='red' clockrate='90000' channels='1'/>
<payload-type id='117' name='ulpfec' clockrate='90000' channels='1'/>
<transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'
ufrag='hpx55NAN46sNYwF+' pwd='hkg/YRpjXZx4qDG3KbzB3qr1'>
<fingerprint hash='sha-256' xmlns='urn:xmpp:jingle:apps:dtls:0'
setup='active'>
08:...:C7
</fingerprint>
</transport>
</channel>
</content>
</conference>
</iq>
]]></example>
<p>
Note that while the result in this case is essentially an
acknowledgement, it still carries a full representation of the
bridge.
</p>
<example caption='Result'><![CDATA[
RECV: <iq to='romeo@montague.lit/orchard' from='garden.montague.lit'
type='result' id='74s'>
<conference xmlns='http://jitsi.org/protocol/colibri'
id='cafb6f2c8197818e'>
<content name='audio'>
<channel rtp-level-relay-type='mixer' direction='recvonly'
initiator='true' id='c6a142b7cf728fd0' expire='60'>
<transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'
pwd='5d0mj' ufrag='amiq'>
<fingerprint xmlns='urn:xmpp:jingle:apps:dtls:0' hash='sha-1'>
99:...:F6
</fingerprint>
<candidate type='host' protocol='udp' id='ca' ip='10.0.1.1' component='1'
port='5144' foundation='3' generation='0' priority='2113932031'
network='0'/>
<candidate type='host' protocol='udp' id='ga' ip='10.0.1.1' component='2'
port='5145' foundation='3' generation='0' priority='2113932030'
network='0'/>
</transport>
</channel>
...
</content>
<content name='video'>
<channel rtp-level-relay-type='translator' initiator='true'
id='c9726594ccb4ede7' expire='60'>
...
</channel>
...
</content>
</conference>
</iq>
]]></example>
</section2>
<section2 topic='Updating transport information for a channel' anchor='usecases-update-transport'>
<example caption='Result'><![CDATA[
SEND: <iq to='garden.montague.lit' from='romeo@montague.lit/orchard'
type='set' id='1581'>
<conference xmlns='http://jitsi.org/protocol/colibri' id='cafb6f2c8197818e'>
<content name='audio'>
<channel id='c6a142b7cf728fd0' initiator='true'>
<transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'>
<candidate foundation='2241210590' component='1' protocol='udp'
priority='2113937151' ip='192.168.2.101' port='61141' type='host'
generation='0' network='1' id='1234'/>
</transport>
</channel>
</content>
</conference>
</iq>
]]></example>
<!--
<p>FIXME: Describe trickle ice?</p>
<p>FIXME: ufrag/pwd missing? required?</p>
-->
<example caption='Result'><![CDATA[
RECV: <iq type='result' to='romeo@montague.lit/orchard' from='garden.montague.lit' id='1581'>
<conference xmlns='http://jitsi.org/protocol/colibri' id='cafb6f2c8197818e'>
<content name='audio'>
<channel rtp-level-relay-type='translator' initiator='true'
id='c6a142b7cf728fd0' expire='60'>
<transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'
pwd='5d0mja27fgl83r9tsl1b9gkk4f' ufrag='amiqp'>
<fingerprint xmlns='urn:xmpp:jingle:apps:dtls:0' hash='sha-1'>
99:..:F6
</fingerprint>
<candidate type='host' protocol='udp' id='1' ip='10.0.1.1' component='1'
port='5144' foundation='3' generation='0' priority='2113932031' network='0'/>
<candidate type='host' protocol='udp' id='2' ip='10.0.1.1' component='2'
port='5145' foundation='3' generation='0' priority='2113932030' network='0'/>
</transport>
</channel>
</content>
</conference>
</iq>
]]></example>
<p>Essentially that information is the transport description from
the bridge. <!-- Do we need that? FIXME -->
</p>
</section2>
<section2 topic='Adding a new channel' anchor='usecases-addchannel'>
<p>
ICE candidates are another reason why a focus might want to
update a channel. Earlier examples indicated how conference
setup could be completed without providing any transport
information whatsoever. Whenever that is the case, such
information would need to be provided through channel
modification.
</p>
<example caption='Adding a new channel'><![CDATA[
SEND: <iq to='garden.montague.lit' from='romeo@montague.lit/orchard' type='get' id='247'>
<conference xmlns='http://jitsi.org/protocol/colibri' id='cafb6f2c8197818e'>
<content creator='initiator' name='audio'>
<channel initiator='true'/>
</content>
<content creator='initiator' name='video'>
<channel initiator='true'/>
</content>
</conference>
</iq>
]]></example>
<example caption='Result'><![CDATA[
RECV: <iq type='result' to='romeo@montague.lit/orchard' from='garden.montague.lit' id='247'>
<conference xmlns='http://jitsi.org/protocol/colibri' id='49a91b4f6694bc6a'>
<content name='audio'>
<channel rtp-level-relay-type='mixer' direction='recvonly' initiator='true'
id='e97d7f794fbed74b' expire='60'>
<source xmlns='urn:xmpp:jingle:apps:rtp:ssma:0' ssrc='2579640556'/>
<transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'
pwd='75e88spurhqbih628ord5a3l9b' ufrag='6c7a6'>
<fingerprint xmlns='urn:xmpp:jingle:apps:dtls:0' hash='sha-1'>
A9:...:2F
</fingerprint>
<candidate type='host' protocol='udp' id='1' ip='10.0.1.1' component='1'
port='5168' foundation='3' generation='0' priority='2113932031'
network='0'/>
<candidate type='host' protocol='udp' id='2' ip='10.0.1.1' component='2'
port='5169' foundation='3' generation='0' priority='2113932030'
network='0'/>
</transport>
</channel>
</content>
<content name='video'>
<channel rtp-level-relay-type='translator' initiator='true'
id='b4557f274216f99a' expire='60'>
<transport xmlns='urn:xmpp:jingle:transports:ice-udp:1'
pwd='1oriuuhjfq6884ln9d3g1rjq13' ufrag='3gh3o'>
<fingerprint xmlns='urn:xmpp:jingle:apps:dtls:0' hash='sha-1'>
D7:...:C2
</fingerprint>
<candidate type='host' protocol='udp' id='1' ip='10.0.1.1' component='1'
port='5170' foundation='3' generation='0' priority='2113932031'
network='0'/>
<candidate type='host' protocol='udp' id='2' ip='10.0.1.1' component='2'
port='5171' foundation='3' generation='0' priority='2113932030'
network='0'/>
</transport>
</channel>
</content>
</conference>
</iq>
]]></example>
</section2>
</section1>
<!--
<section1 topic='Use with Jingle' anchor='use-with-jingle'>
<p>TODO: a non-normative section showing the flow of jingle sessions and bridge interaction</p>
</section1>
-->
<section1 topic='Determining Support' anchor='support'>
<p>If an entity supports COLIBRI, it SHOULD advertise that fact by
returning
a feature of "http://jitsi.org/protocol/colibri" in response to
a &xep0030;
information request.
</p>
<example caption="Service Discovery Information Request"><![CDATA[
<iq from='kingclaudius@shakespeare.lit/castle'
id='ku6e51v3'
to='belfry.shakespeare.lit'
type='get'>
<query xmlns='http://jabber.org/protocol/disco#info'/>
</iq>
]]></example>
<example caption="Service Discovery Information Response"><![CDATA[
<iq from='belfry.shakespeare.lit'
id='ku6e51v3'
to='kingclaudius@shakespeare.lit/castle'
type='result'>
<query xmlns='http://jabber.org/protocol/disco#info'>
<feature var='http://jitsi.org/protocol/colibri'/>
</query>
</iq>
]]></example>
<p>In order for an application to determine whether an entity
supports this
protocol, where possible it SHOULD use the dynamic, presence-based
profile
of service discovery defined in &xep0115;. However, if an
application has
not received entity capabilities information from an entity, it
SHOULD use
explicit service discovery instead.
</p>
</section1>
<section1 topic='Security Considerations' anchor='security'>
<p>PENDING</p>
</section1>
<section1 topic='Open Issues' anchor='issues'>
<p>PENDING</p>
</section1>
<section1 topic='XML Schemas' anchor='schema'>
<p>PENDING</p>
</section1>
<section1 topic='Acknowledgements' anchor='acks'>
<p>Jitsi's participation in this specification is funded by the
NLnet
Foundation.
</p>
</section1>
</xep>