mirror of
https://github.com/moparisthebest/xeps
synced 2024-11-22 09:12:19 -05:00
7a54054335
Added section about subsequences.
269 lines
11 KiB
XML
269 lines
11 KiB
XML
<?xml version='1.0' encoding='UTF-8'?>
|
|
<!DOCTYPE xep SYSTEM 'xep.dtd' [
|
|
<!ENTITY % ents SYSTEM 'xep.ent'>
|
|
%ents;
|
|
]>
|
|
<?xml-stylesheet type='text/xsl' href='xep.xsl'?>
|
|
<xep>
|
|
<header>
|
|
<title>Character counting in message bodies</title>
|
|
<abstract>
|
|
This document describes how to correctly count characters in message bodies.
|
|
This is required when referencing a position in the body.
|
|
</abstract>
|
|
&LEGALNOTICE;
|
|
<number>0426</number>
|
|
<status>Experimental</status>
|
|
<type>Informational</type>
|
|
<sig>Standards</sig>
|
|
<approver>Council</approver>
|
|
<dependencies/>
|
|
<supersedes/>
|
|
<supersededby/>
|
|
<shortname>charcount</shortname>
|
|
&larma;
|
|
<revision>
|
|
<version>0.3.0</version>
|
|
<date>2022-12-27</date>
|
|
<initials>lmw</initials>
|
|
<remark>Added section about subsequences.</remark>
|
|
</revision>
|
|
<revision>
|
|
<version>0.2.0</version>
|
|
<date>2020-01-02</date>
|
|
<initials>mw</initials>
|
|
<remark>Include feedback/clarifications from list.</remark>
|
|
</revision>
|
|
<revision>
|
|
<version>0.1.0</version>
|
|
<date>2019-12-26</date>
|
|
<initials>mb</initials>
|
|
<remark>Promote to Experimental as per Council decision.</remark>
|
|
</revision>
|
|
<revision>
|
|
<version>0.0.1</version>
|
|
<date>2019-12-15</date>
|
|
<initials>mw</initials>
|
|
<remark>Initial attempt to finalize the discussions.</remark>
|
|
</revision>
|
|
</header>
|
|
|
|
<section1 topic='Introduction' anchor='intro'>
|
|
<p>
|
|
Various use-cases require the possibility to reference a part of the message
|
|
body or a specific position in it. This was realized by providing offsets
|
|
from the beginning of the message (when referencing a region, those offsets
|
|
would define begin and end of a region). XEPs doing so include &xep0301;,
|
|
&xep0372; (and thereof derived &xep0385;) and &xep0394;.
|
|
</p>
|
|
<p>
|
|
For these use-cases, it is highly relevant to decide how to count "characters"
|
|
in a message body. While it at first sounds trivial, there are various ways
|
|
of doing so in modern font systems. The purpose of this XEP is to define how
|
|
characters shall be counted for the purpose of the aforementioned XEPs and
|
|
any future XEP relying on a similar feature.
|
|
</p>
|
|
</section1>
|
|
|
|
<section1 topic='Character counting' anchor='counting'>
|
|
<p>
|
|
When counting characters in a body, they shall be counted by their
|
|
<strong>number of Unicode code points</strong>. Message bodies must be used
|
|
as strings of the XML characters (as defined in §2.2 of &w3xml;). This means
|
|
that, i.e. no Unicode normalization may be performed before determining
|
|
offsets when receiving or after determining offsets when sending.
|
|
Any kind of further body processing shall be performed after counting (e.g.
|
|
<tt>/me·</tt><note>The middle dot is used to represent a space character
|
|
and is not meant to be taken verbatim.</note> as described in &xep0245; is
|
|
always counted as 4 characters without considering the sending user's name).
|
|
All references (as defined in §4.1 of &w3xml;) must be counted by their
|
|
referenced character(s) and not the reference characters (e.g. the encoded
|
|
<tt>&amp;</tt> is counted as one decoded character <tt>&</tt>).
|
|
</p>
|
|
<table caption='Example strings and their counted length'>
|
|
<tr>
|
|
<th>String</th>
|
|
<th>Grapheme cluster</th>
|
|
<th>UTF-8 bytes</th>
|
|
<th>UTF-16 units (2 bytes)</th>
|
|
<th>Code points</th>
|
|
</tr>
|
|
<tr>
|
|
<td>Hello, world!</td>
|
|
<td>13</td>
|
|
<td>13</td>
|
|
<td>13</td>
|
|
<td>13</td>
|
|
</tr>
|
|
<tr>
|
|
<td>You & Me</td>
|
|
<td>8</td>
|
|
<td>8</td>
|
|
<td>8</td>
|
|
<td>8</td>
|
|
</tr>
|
|
<tr>
|
|
<td>こんにちは世界</td>
|
|
<td>7</td>
|
|
<td>21</td>
|
|
<td>7</td>
|
|
<td>7</td>
|
|
</tr>
|
|
<tr>
|
|
<td>🧛🏾 👨👨👦👦 🇺🇳</td>
|
|
<td>
|
|
5
|
|
<note>
|
|
There are spaces between the emojis. You may also perceive this as
|
|
more than 5 glyphs if your font or display engine does not support
|
|
the required Unicode version.
|
|
</note>
|
|
</td>
|
|
<td>43</td>
|
|
<td>21</td>
|
|
<td>13</td>
|
|
</tr>
|
|
</table>
|
|
<section2 topic='Illegal offsets' anchor='illegal-offsets'>
|
|
<p>
|
|
As grapheme clusters may consist of multiple code points, a code point
|
|
offset might be illegal if it points inside a grapheme cluster.
|
|
</p>
|
|
<p>
|
|
However, receiving entities SHOULD NOT consider illegal offsets invalid,
|
|
as different Unicode versions may have different understanding of what a
|
|
grapheme cluster is. Instead, receiving entities may choose one of the
|
|
following behaviors:
|
|
</p>
|
|
<ul>
|
|
<li>
|
|
Split the grapheme cluster into multiple graphemes. In most cases, this
|
|
is closest to the intended behavior. Many font display engines will do
|
|
this automatically as needed.
|
|
</li>
|
|
<li>
|
|
When the offset defines the end of a region, include the full grapheme
|
|
cluster in the region. Otherwise, take the offset as if it pointed to
|
|
the beginning of the grapheme cluster.
|
|
</li>
|
|
</ul>
|
|
</section2>
|
|
<section2 topic='Developer notes' anchor='developer-notes'>
|
|
<p>
|
|
Some programming languages include a string type that operates directly on
|
|
Unicode code points. If these types are used, offset numbers can be used
|
|
as-is in string operations. Popular examples of such programming languages
|
|
are Python and Haskell.
|
|
</p>
|
|
<p>
|
|
Other programming languages include a string type that operates on UTF-16
|
|
units. As can be seen in the table above, those match the number of code
|
|
points in many cases and thus are sometimes confused to be the same.
|
|
Popular examples of such programming languages are C#, Java and
|
|
JavaScript.
|
|
</p>
|
|
<p>
|
|
C/C++ includes a wide character and string type. Those behave differently
|
|
across platforms and as such should be used with care.
|
|
</p>
|
|
</section2>
|
|
<section2 topic='Rationale' anchor='rationale'>
|
|
<p>
|
|
The most obvious way of counting characters is to count them how humans
|
|
would. This sounds easy when only having western scripts in mind but becomes
|
|
more complicated in other scripts and most importantly is not well-defined
|
|
across Unicode versions. New unicode versions regularly added new
|
|
possibilities to build grapheme clusters, including from existing code
|
|
points. To be forward compatible, counting grapheme clusters, graphemes,
|
|
glyphs or similar is thus not an option.
|
|
This leaves basically the two options of using the number of code units of
|
|
the encoded string or the number of code points.
|
|
</p>
|
|
<p>
|
|
The main advantage of using the code units would be that those are native to
|
|
many programming languages, easing the task for developers.
|
|
However programming languages do not share a common encoding for their
|
|
string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
|
|
internal encoding from the developer and only presents it in code points),
|
|
so there is no best pick here.
|
|
If one was to choose an encoding, the best choice would be UTF-8, the native
|
|
encoding of XMPP. However this makes counting bytes a more complex task for
|
|
programming languages that use a different encoding like UTF-16, as strings
|
|
would need to be transcoded first.
|
|
</p>
|
|
<p>
|
|
Counting code points has the advantage that offset counts cannot point
|
|
inside a code point. This could happen when using code units of any encoding
|
|
that may use more than one unit to represent a code point (such as UTF-8 and
|
|
UTF-16).
|
|
If an offset count points inside a code point, that would be an invalid
|
|
offset, raising more uncertainty of the correct behavior in such cases. Most
|
|
notably the opportunity of splitting (as it exists for grapheme cluster) is
|
|
not an option in that case, because splitting a code point would not create
|
|
any usable output.
|
|
Counting code points is widely supported in programming languages and can
|
|
easily be implemented for encoded strings when not.
|
|
The &w3xml; standard also defines a character as a unicode code point, thus
|
|
counting code points is equivalent to counting XML characters.
|
|
</p>
|
|
</section2>
|
|
</section1>
|
|
|
|
<section1 topic='Subsequences' anchor='subsequence'>
|
|
<p>
|
|
When referencing a subsequence of the characters of a message body, the
|
|
begin and end of the subsequence should be provided by two numbers, denoting
|
|
the number of characters (counted as described above) before the begin of the
|
|
subsequence or before the end of the subsequence, respectively. In other
|
|
words, the begin is the index of the first character in the subsequence and
|
|
the end is the index following the last character in the subsequence. That
|
|
means, if a subsequence covers the full body, its begin should be given as
|
|
0 and its end should be given as the number of characters in the body.
|
|
</p>
|
|
|
|
<section2 topic='Developer notes' anchor='subsequence-developer-notes'>
|
|
<p>
|
|
Subsequence indexing in various programming languages match the convention
|
|
described here. When using Python, the subsequence created by
|
|
<tt>body[begin:end]</tt> matches all requirements of this document.
|
|
</p>
|
|
<p>
|
|
Some programming languages define subsequences by offset and length. In
|
|
this case, begin matchs the offset while end-begin matches the length.
|
|
</p>
|
|
</section2>
|
|
<section2 topic='Rationale' anchor='subsequence-rationale'>
|
|
<p>
|
|
The convention for subsequences was choosen because it has three main
|
|
advantages: It matches subsequence indexing in various programming
|
|
languages, end minus begin of a subsequence equal the length of the
|
|
subsequence and the end of the first of two adjacent subsequence matches the
|
|
begin of the second one.
|
|
</p>
|
|
</section2>
|
|
</section1>
|
|
|
|
<section1 topic='Glossary' anchor='glossary'>
|
|
<p>
|
|
Unicode terminology used across this document, can be looked up in the
|
|
Unicode glossary at <link url='https://www.unicode.org/glossary/'>https://www.unicode.org/glossary/</link>.
|
|
</p>
|
|
</section1>
|
|
|
|
<section1 topic='IANA Considerations' anchor='iana'>
|
|
<p>This document requires no interaction with &IANA;. </p>
|
|
</section1>
|
|
|
|
<section1 topic='XMPP Registrar Considerations' anchor='registrar'>
|
|
<p>This document requires no interaction with ®ISTRAR;.</p>
|
|
</section1>
|
|
|
|
<section1 topic='Acknowledgements' anchor='acknowledgements'>
|
|
<p>
|
|
The author would like to thank Guus der Kinderen, Ralph Meijer, Jonas
|
|
Schäfer, Lance Stout and others that provided feedback.
|
|
</p>
|
|
</section1>
|
|
|
|
</xep>
|