xeps/xep-0426.xml

<?xml version='1.0' encoding='UTF-8'?>
<!DOCTYPE xep SYSTEM 'xep.dtd' [
  <!ENTITY % ents SYSTEM 'xep.ent'>
  %ents;
]>
<?xml-stylesheet type='text/xsl' href='xep.xsl'?>
<xep>
<header>
  <title>Character counting in message bodies</title>
  <abstract>
    This document describes how to correctly count characters in message bodies.
    This is required when referencing a position in the body.
  </abstract>
  &LEGALNOTICE;
  <number>0426</number>
  <status>Experimental</status>
  <type>Informational</type>
  <sig>Standards</sig>
  <approver>Council</approver>
  <dependencies/>
  <supersedes/>
  <supersededby/>
  <shortname>charcount</shortname>
  &larma;
  <revision>
    <version>0.3.0</version>
    <date>2022-12-27</date>
    <initials>lmw</initials>
    <remark>Added section about subsequences.</remark>
  </revision>
  <revision>
    <version>0.2.0</version>
    <date>2020-01-02</date>
    <initials>mw</initials>
    <remark>Include feedback/clarifications from list.</remark>
  </revision>
  <revision>
    <version>0.1.0</version>
    <date>2019-12-26</date>
    <initials>mb</initials>
    <remark>Promote to Experimental as per Council decision.</remark>
  </revision>
  <revision>
    <version>0.0.1</version>
    <date>2019-12-15</date>
    <initials>mw</initials>
    <remark>Initial attempt to finalize the discussions.</remark>
  </revision>
</header>

<section1 topic='Introduction' anchor='intro'>
  <p>
    Various use-cases require the possibility to reference a part of the message
    body or a specific position in it. This was realized by providing offsets
    from the beginning of the message (when referencing a region, those offsets
    would define begin and end of a region). XEPs doing so include &xep0301;,
    &xep0372; (and thereof derived &xep0385;) and &xep0394;.
  </p>
  <p>
    For these use-cases, it is highly relevant to decide how to count "characters"
    in a message body. While it at first sounds trivial, there are various ways
    of doing so in modern font systems. The purpose of this XEP is to define how
    characters shall be counted for the purpose of the aforementioned XEPs and
    any future XEP relying on a similar feature.
  </p>
</section1>

<section1 topic='Character counting' anchor='counting'>
  <p>
    When counting characters in a body, they shall be counted by their
    <strong>number of Unicode code points</strong>. Message bodies must be used
    as strings of the XML characters (as defined in §2.2 of &w3xml;). This means
    that, i.e. no Unicode normalization may be performed before determining
    offsets when receiving or after determining offsets when sending.
    Any kind of further body processing shall be performed after counting (e.g.
    <tt>/me·</tt><note>The middle dot is used to represent a space character
    and is not meant to be taken verbatim.</note> as described in &xep0245; is
    always counted as 4 characters without considering the sending user's name).
    All references (as defined in §4.1 of &w3xml;) must be counted by their
    referenced character(s) and not the reference characters (e.g. the encoded
    <tt>&amp;amp;</tt> is counted as one decoded character <tt>&amp;</tt>).
  </p>
  <table caption='Example strings and their counted length'>
    <tr>
      <th>String</th>
      <th>Grapheme cluster</th>
      <th>UTF-8 bytes</th>
      <th>UTF-16 units (2 bytes)</th>
      <th>Code points</th>
    </tr>
    <tr>
      <td>Hello, world!</td>
      <td>13</td>
      <td>13</td>
      <td>13</td>
      <td>13</td>
    </tr>
    <tr>
      <td>You &amp; Me</td>
      <td>8</td>
      <td>8</td>
      <td>8</td>
      <td>8</td>
    </tr>
    <tr>
      <td>こんにちは世界</td>
      <td>7</td>
      <td>21</td>
      <td>7</td>
      <td>7</td>
    </tr>
    <tr>
      <td>🧛🏾 👨‍👨‍👦‍👦 🇺🇳</td>
      <td>
        5
        <note>
          There are spaces between the emojis. You may also perceive this as
          more than 5 glyphs if your font or display engine does not support
          the required Unicode version.
        </note>
      </td>
      <td>43</td>
      <td>21</td>
      <td>13</td>
    </tr>
  </table>
  <section2 topic='Illegal offsets' anchor='illegal-offsets'>
    <p>
      As grapheme clusters may consist of multiple code points, a code point
      offset might be illegal if it points inside a grapheme cluster.
    </p>
    <p>
      However, receiving entities SHOULD NOT consider illegal offsets invalid,
      as different Unicode versions may have different understanding of what a
      grapheme cluster is. Instead, receiving entities may choose one of the
      following behaviors:
    </p>
    <ul>
      <li>
        Split the grapheme cluster into multiple graphemes. In most cases, this
        is closest to the intended behavior. Many font display engines will do
        this automatically as needed.
      </li>
      <li>
        When the offset defines the end of a region, include the full grapheme
        cluster in the region. Otherwise, take the offset as if it pointed to
        the beginning of the grapheme cluster.
      </li>
    </ul>
  </section2>
  <section2 topic='Developer notes' anchor='developer-notes'>
    <p>
      Some programming languages include a string type that operates directly on
      Unicode code points. If these types are used, offset numbers can be used
      as-is in string operations. Popular examples of such programming languages
      are Python and Haskell.
    </p>
    <p>
      Other programming languages include a string type that operates on UTF-16
      units. As can be seen in the table above, those match the number of code
      points in many cases and thus are sometimes confused to be the same.
      Popular examples of such programming languages are C#, Java and
      JavaScript.
    </p>
    <p>
      C/C++ includes a wide character and string type. Those behave differently
      across platforms and as such should be used with care.
    </p>
  </section2>
  <section2 topic='Rationale' anchor='rationale'>
    <p>
      The most obvious way of counting characters is to count them how humans
      would. This sounds easy when only having western scripts in mind but becomes
      more complicated in other scripts and most importantly is not well-defined
      across Unicode versions. New unicode versions regularly added new
      possibilities to build grapheme clusters, including from existing code
      points. To be forward compatible, counting grapheme clusters, graphemes,
      glyphs or similar is thus not an option.
      This leaves basically the two options of using the number of code units of
      the encoded string or the number of code points.
    </p>
    <p>
      The main advantage of using the code units would be that those are native to
      many programming languages, easing the task for developers.
      However programming languages do not share a common encoding for their
      string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
      internal encoding from the developer and only presents it in code points),
      so there is no best pick here.
      If one was to choose an encoding, the best choice would be UTF-8, the native
      encoding of XMPP. However this makes counting bytes a more complex task for
      programming languages that use a different encoding like UTF-16, as strings
      would need to be transcoded first.
    </p>
    <p>
      Counting code points has the advantage that offset counts cannot point
      inside a code point. This could happen when using code units of any encoding
      that may use more than one unit to represent a code point (such as UTF-8 and
      UTF-16).
      If an offset count points inside a code point, that would be an invalid
      offset, raising more uncertainty of the correct behavior in such cases. Most
      notably the opportunity of splitting (as it exists for grapheme cluster) is
      not an option in that case, because splitting a code point would not create
      any usable output.
      Counting code points is widely supported in programming languages and can
      easily be implemented for encoded strings when not.
      The &w3xml; standard also defines a character as a unicode code point, thus
      counting code points is equivalent to counting XML characters.
    </p>
  </section2>
</section1>

<section1 topic='Subsequences' anchor='subsequence'>
  <p>
    When referencing a subsequence of the characters of a message body, the
    begin and end of the subsequence should be provided by two numbers, denoting
    the number of characters (counted as described above) before the begin of the
    subsequence or before the end of the subsequence, respectively. In other
    words, the begin is the index of the first character in the subsequence and
    the end is the index following the last character in the subsequence. That 
    means, if a subsequence covers the full body, its begin should be given as 
    0 and its end should be given as the number of characters in the body.
  </p>

  <section2 topic='Developer notes' anchor='subsequence-developer-notes'>
    <p>
      Subsequence indexing in various programming languages match the convention
      described here. When using Python, the subsequence created by
      <tt>body[begin:end]</tt> matches all requirements of this document.
    </p>
    <p>
      Some programming languages define subsequences by offset and length. In
      this case, begin matchs the offset while end-begin matches the length.
    </p>
  </section2>
  <section2 topic='Rationale' anchor='subsequence-rationale'>
    <p>
      The convention for subsequences was choosen because it has three main
      advantages: It matches subsequence indexing in various programming
      languages, end minus begin of a subsequence equal the length of the
      subsequence and the end of the first of two adjacent subsequence matches the
      begin of the second one.
    </p>
  </section2>
</section1>

<section1 topic='Glossary' anchor='glossary'>
  <p>
    Unicode terminology used across this document, can be looked up in the
    Unicode glossary at <link url='https://www.unicode.org/glossary/'>https://www.unicode.org/glossary/</link>.
  </p>
</section1>

<section1 topic='IANA Considerations' anchor='iana'>
  <p>This document requires no interaction with &IANA;. </p>
</section1>

<section1 topic='XMPP Registrar Considerations' anchor='registrar'>
  <p>This document requires no interaction with &REGISTRAR;.</p>
</section1>

<section1 topic='Acknowledgements' anchor='acknowledgements'>
  <p>
    The author would like to thank Guus der Kinderen, Ralph Meijer, Jonas
    Schäfer, Lance Stout and others that provided feedback.
  </p>
</section1>

</xep>
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`<?xml version='1.0' encoding='UTF-8'?>`
			`<!DOCTYPE xep SYSTEM 'xep.dtd' [`
			`<!ENTITY % ents SYSTEM 'xep.ent'>`
			`%ents;`
			`]>`
			`<?xml-stylesheet type='text/xsl' href='xep.xsl'?>`
			`<xep>`
			`<header>`
			`<title>Character counting in message bodies</title>`
			`<abstract>`
			`This document describes how to correctly count characters in message bodies.`
			`This is required when referencing a position in the body.`
			`</abstract>`
			`&LEGALNOTICE;`
XEP-0426: Promote charcount ProtoXEP to informational As per Council vote. https://mail.jabber.org/pipermail/standards/2019-December/036757.html + Ge0rG on-list: https://mail.jabber.org/pipermail/standards/2019-December/036758.html Signed-off-by: Maxime “pep” Buquet <pep@bouah.net> 2019-12-26 16:16:23 -05:00			`<number>0426</number>`
XEP-0426: Character Counting 0.3.0 Added section about subsequences. 2022-12-27 16:12:17 -05:00			`<status>Experimental</status>`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`<type>Informational</type>`
			`<sig>Standards</sig>`
XEP-0426: Character Counting 0.3.0 Added section about subsequences. 2022-12-27 16:12:17 -05:00			`<approver>Council</approver>`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`<dependencies/>`
			`<supersedes/>`
			`<supersededby/>`
			`<shortname>charcount</shortname>`
XEP-0426: Character Counting 0.3.0 Added section about subsequences. 2022-12-27 16:12:17 -05:00			`&larma;`
			`<revision>`
			`<version>0.3.0</version>`
			`<date>2022-12-27</date>`
			`<initials>lmw</initials>`
			`<remark>Added section about subsequences.</remark>`
			`</revision>`
charcount 0.2: Include feedback/clarifications from list 2020-01-02 08:05:07 -05:00			`<revision>`
			`<version>0.2.0</version>`
			`<date>2020-01-02</date>`
			`<initials>mw</initials>`
			`<remark>Include feedback/clarifications from list.</remark>`
			`</revision>`
XEP-0426: Promote charcount ProtoXEP to informational As per Council vote. https://mail.jabber.org/pipermail/standards/2019-December/036757.html + Ge0rG on-list: https://mail.jabber.org/pipermail/standards/2019-December/036758.html Signed-off-by: Maxime “pep” Buquet <pep@bouah.net> 2019-12-26 16:16:23 -05:00			`<revision>`
XEP-0426: Change version to 0.x Signed-off-by: Maxime “pep” Buquet <pep@bouah.net> 2019-12-26 18:42:20 -05:00			`<version>0.1.0</version>`
XEP-0426: Promote charcount ProtoXEP to informational As per Council vote. https://mail.jabber.org/pipermail/standards/2019-December/036757.html + Ge0rG on-list: https://mail.jabber.org/pipermail/standards/2019-December/036758.html Signed-off-by: Maxime “pep” Buquet <pep@bouah.net> 2019-12-26 16:16:23 -05:00			`<date>2019-12-26</date>`
			`<initials>mb</initials>`
			`<remark>Promote to Experimental as per Council decision.</remark>`
			`</revision>`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`<revision>`
			`<version>0.0.1</version>`
			`<date>2019-12-15</date>`
			`<initials>mw</initials>`
			`<remark>Initial attempt to finalize the discussions.</remark>`
			`</revision>`
			`</header>`

			`<section1 topic='Introduction' anchor='intro'>`
			`<p>`
			`Various use-cases require the possibility to reference a part of the message`
charcount 0.2: Include feedback/clarifications from list 2020-01-02 08:05:07 -05:00			`body or a specific position in it. This was realized by providing offsets`
			`from the beginning of the message (when referencing a region, those offsets`
			`would define begin and end of a region). XEPs doing so include &xep0301;,`
			`&xep0372; (and thereof derived &xep0385;) and &xep0394;.`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`</p>`
			`<p>`
charcount 0.2: Include feedback/clarifications from list 2020-01-02 08:05:07 -05:00			`For these use-cases, it is highly relevant to decide how to count "characters"`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`in a message body. While it at first sounds trivial, there are various ways`
			`of doing so in modern font systems. The purpose of this XEP is to define how`
			`characters shall be counted for the purpose of the aforementioned XEPs and`
			`any future XEP relying on a similar feature.`
			`</p>`
			`</section1>`

			`<section1 topic='Character counting' anchor='counting'>`
			`<p>`
			`When counting characters in a body, they shall be counted by their`
charcount 0.2: Include feedback/clarifications from list 2020-01-02 08:05:07 -05:00			`<strong>number of Unicode code points</strong>. Message bodies must be used`
			`as strings of the XML characters (as defined in §2.2 of &w3xml;). This means`
			`that, i.e. no Unicode normalization may be performed before determining`
			`offsets when receiving or after determining offsets when sending.`
			`Any kind of further body processing shall be performed after counting (e.g.`
			`<tt>/me·</tt><note>The middle dot is used to represent a space character`
			`and is not meant to be taken verbatim.</note> as described in &xep0245; is`
			`always counted as 4 characters without considering the sending user's name).`
			`All references (as defined in §4.1 of &w3xml;) must be counted by their`
			`referenced character(s) and not the reference characters (e.g. the encoded`
			`<tt>&amp;</tt> is counted as one decoded character <tt>&</tt>).`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`</p>`
			`<table caption='Example strings and their counted length'>`
			`<tr>`
			`<th>String</th>`
			`<th>Grapheme cluster</th>`
			`<th>UTF-8 bytes</th>`
			`<th>UTF-16 units (2 bytes)</th>`
			`<th>Code points</th>`
			`</tr>`
			`<tr>`
			`<td>Hello, world!</td>`
			`<td>13</td>`
			`<td>13</td>`
			`<td>13</td>`
			`<td>13</td>`
			`</tr>`
charcount 0.2: Include feedback/clarifications from list 2020-01-02 08:05:07 -05:00			`<tr>`
			`<td>You & Me</td>`
			`<td>8</td>`
			`<td>8</td>`
			`<td>8</td>`
			`<td>8</td>`
			`</tr>`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`<tr>`
			`<td>こんにちは世界</td>`
			`<td>7</td>`
			`<td>21</td>`
			`<td>7</td>`
			`<td>7</td>`
			`</tr>`
			`<tr>`
			`<td>🧛🏾 👨‍👨‍👦‍👦 🇺🇳</td>`
			`<td>`
			`5`
			`<note>`
			`There are spaces between the emojis. You may also perceive this as`
			`more than 5 glyphs if your font or display engine does not support`
			`the required Unicode version.`
			`</note>`
			`</td>`
			`<td>43</td>`
			`<td>21</td>`
			`<td>13</td>`
			`</tr>`
			`</table>`
			`<section2 topic='Illegal offsets' anchor='illegal-offsets'>`
			`<p>`
			`As grapheme clusters may consist of multiple code points, a code point`
			`offset might be illegal if it points inside a grapheme cluster.`
			`</p>`
			`<p>`
			`However, receiving entities SHOULD NOT consider illegal offsets invalid,`
			`as different Unicode versions may have different understanding of what a`
			`grapheme cluster is. Instead, receiving entities may choose one of the`
			`following behaviors:`
			`</p>`
			`<ul>`
			`<li>`
			`Split the grapheme cluster into multiple graphemes. In most cases, this`
			`is closest to the intended behavior. Many font display engines will do`
			`this automatically as needed.`
			`</li>`
			`<li>`
			`When the offset defines the end of a region, include the full grapheme`
			`cluster in the region. Otherwise, take the offset as if it pointed to`
			`the beginning of the grapheme cluster.`
			`</li>`
			`</ul>`
			`</section2>`
			`<section2 topic='Developer notes' anchor='developer-notes'>`
			`<p>`
			`Some programming languages include a string type that operates directly on`
			`Unicode code points. If these types are used, offset numbers can be used`
			`as-is in string operations. Popular examples of such programming languages`
			`are Python and Haskell.`
			`</p>`
			`<p>`
			`Other programming languages include a string type that operates on UTF-16`
			`units. As can be seen in the table above, those match the number of code`
			`points in many cases and thus are sometimes confused to be the same.`
			`Popular examples of such programming languages are C#, Java and`
			`JavaScript.`
			`</p>`
			`<p>`
			`C/C++ includes a wide character and string type. Those behave differently`
			`across platforms and as such should be used with care.`
			`</p>`
			`</section2>`
XEP-0426: Character Counting 0.3.0 Added section about subsequences. 2022-12-27 16:12:17 -05:00			`<section2 topic='Rationale' anchor='rationale'>`
			`<p>`
			`The most obvious way of counting characters is to count them how humans`
			`would. This sounds easy when only having western scripts in mind but becomes`
			`more complicated in other scripts and most importantly is not well-defined`
			`across Unicode versions. New unicode versions regularly added new`
			`possibilities to build grapheme clusters, including from existing code`
			`points. To be forward compatible, counting grapheme clusters, graphemes,`
			`glyphs or similar is thus not an option.`
			`This leaves basically the two options of using the number of code units of`
			`the encoded string or the number of code points.`
			`</p>`
			`<p>`
			`The main advantage of using the code units would be that those are native to`
			`many programming languages, easing the task for developers.`
			`However programming languages do not share a common encoding for their`
			`string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the`
			`internal encoding from the developer and only presents it in code points),`
			`so there is no best pick here.`
			`If one was to choose an encoding, the best choice would be UTF-8, the native`
			`encoding of XMPP. However this makes counting bytes a more complex task for`
			`programming languages that use a different encoding like UTF-16, as strings`
			`would need to be transcoded first.`
			`</p>`
			`<p>`
			`Counting code points has the advantage that offset counts cannot point`
			`inside a code point. This could happen when using code units of any encoding`
			`that may use more than one unit to represent a code point (such as UTF-8 and`
			`UTF-16).`
			`If an offset count points inside a code point, that would be an invalid`
			`offset, raising more uncertainty of the correct behavior in such cases. Most`
			`notably the opportunity of splitting (as it exists for grapheme cluster) is`
			`not an option in that case, because splitting a code point would not create`
			`any usable output.`
			`Counting code points is widely supported in programming languages and can`
			`easily be implemented for encoded strings when not.`
			`The &w3xml; standard also defines a character as a unicode code point, thus`
			`counting code points is equivalent to counting XML characters.`
			`</p>`
			`</section2>`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`</section1>`

XEP-0426: Character Counting 0.3.0 Added section about subsequences. 2022-12-27 16:12:17 -05:00			`<section1 topic='Subsequences' anchor='subsequence'>`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`<p>`
XEP-0426: Character Counting 0.3.0 Added section about subsequences. 2022-12-27 16:12:17 -05:00			`When referencing a subsequence of the characters of a message body, the`
			`begin and end of the subsequence should be provided by two numbers, denoting`
			`the number of characters (counted as described above) before the begin of the`
			`subsequence or before the end of the subsequence, respectively. In other`
			`words, the begin is the index of the first character in the subsequence and`
			`the end is the index following the last character in the subsequence. That`
			`means, if a subsequence covers the full body, its begin should be given as`
			`0 and its end should be given as the number of characters in the body.`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`</p>`
XEP-0426: Character Counting 0.3.0 Added section about subsequences. 2022-12-27 16:12:17 -05:00
			`<section2 topic='Developer notes' anchor='subsequence-developer-notes'>`
			`<p>`
			`Subsequence indexing in various programming languages match the convention`
			`described here. When using Python, the subsequence created by`
			`<tt>body[begin:end]</tt> matches all requirements of this document.`
			`</p>`
			`<p>`
			`Some programming languages define subsequences by offset and length. In`
			`this case, begin matchs the offset while end-begin matches the length.`
			`</p>`
			`</section2>`
			`<section2 topic='Rationale' anchor='subsequence-rationale'>`
			`<p>`
			`The convention for subsequences was choosen because it has three main`
			`advantages: It matches subsequence indexing in various programming`
			`languages, end minus begin of a subsequence equal the length of the`
			`subsequence and the end of the first of two adjacent subsequence matches the`
			`begin of the second one.`
			`</p>`
			`</section2>`
Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`</section1>`

			`<section1 topic='Glossary' anchor='glossary'>`
			`<p>`
			`Unicode terminology used across this document, can be looked up in the`
			`Unicode glossary at <link url='https://www.unicode.org/glossary/'>https://www.unicode.org/glossary/</link>.`
			`</p>`
			`</section1>`

			`<section1 topic='IANA Considerations' anchor='iana'>`
			`<p>This document requires no interaction with &IANA;. </p>`
			`</section1>`

			`<section1 topic='XMPP Registrar Considerations' anchor='registrar'>`
			`<p>This document requires no interaction with &REGISTRAR;.</p>`
			`</section1>`

charcount 0.2: Include feedback/clarifications from list 2020-01-02 08:05:07 -05:00			`<section1 topic='Acknowledgements' anchor='acknowledgements'>`
			`<p>`
			`The author would like to thank Guus der Kinderen, Ralph Meijer, Jonas`
			`Schäfer, Lance Stout and others that provided feedback.`
			`</p>`
			`</section1>`

Add ProtoXEP: Character counting in message bodies 2019-12-15 13:43:03 -05:00			`</xep>`