Merge branch 'feature/xep-0426'

This commit is contained in:
Jonas Schäfer 2020-01-02 17:54:52 +01:00
commit c3eba8e927
1 changed files with 38 additions and 6 deletions

View File

@ -26,6 +26,12 @@
<email>xsf@larma.de</email>
<jid>jabber@larma.de</jid>
</author>
<revision>
<version>0.2.0</version>
<date>2020-01-02</date>
<initials>mw</initials>
<remark>Include feedback/clarifications from list.</remark>
</revision>
<revision>
<version>0.1.0</version>
<date>2019-12-26</date>
@ -43,12 +49,13 @@
<section1 topic='Introduction' anchor='intro'>
<p>
Various use-cases require the possibility to reference a part of the message
body. This was realized by providing the offset of the beginning and end of
the referenced region as offset from the beginning of the message. XEPs
doing so include &xep0372; (and thereof derived &xep0385;) and &xep0394;.
body or a specific position in it. This was realized by providing offsets
from the beginning of the message (when referencing a region, those offsets
would define begin and end of a region). XEPs doing so include &xep0301;,
&xep0372; (and thereof derived &xep0385;) and &xep0394;.
</p>
<p>
For this method, it is highly relevant to decide how to count "characters"
For these use-cases, it is highly relevant to decide how to count "characters"
in a message body. While it at first sounds trivial, there are various ways
of doing so in modern font systems. The purpose of this XEP is to define how
characters shall be counted for the purpose of the aforementioned XEPs and
@ -59,8 +66,17 @@
<section1 topic='Character counting' anchor='counting'>
<p>
When counting characters in a body, they shall be counted by their
<strong>number of Unicode code points</strong>. Message bodies must not be
normalized when counting code points.
<strong>number of Unicode code points</strong>. Message bodies must be used
as strings of the XML characters (as defined in §2.2 of &w3xml;). This means
that, i.e. no Unicode normalization may be performed before determining
offsets when receiving or after determining offsets when sending.
Any kind of further body processing shall be performed after counting (e.g.
<tt>/me·</tt><note>The middle dot is used to represent a space character
and is not meant to be taken verbatim.</note> as described in &xep0245; is
always counted as 4 characters without considering the sending user's name).
All references (as defined in §4.1 of &w3xml;) must be counted by their
referenced character(s) and not the reference characters (e.g. the encoded
<tt>&amp;amp;</tt> is counted as one decoded character <tt>&amp;</tt>).
</p>
<table caption='Example strings and their counted length'>
<tr>
@ -77,6 +93,13 @@
<td>13</td>
<td>13</td>
</tr>
<tr>
<td>You &amp; Me</td>
<td>8</td>
<td>8</td>
<td>8</td>
<td>8</td>
</tr>
<tr>
<td>こんにちは世界</td>
<td>7</td>
@ -180,6 +203,8 @@
any usable output.
Counting code points is widely supported in programming languages and can
easily be implemented for encoded strings when not.
The &w3xml; standard also defines a character as a unicode code point, thus
counting code points is equivalent to counting XML characters.
</p>
</section1>
@ -198,4 +223,11 @@
<p>This document requires no interaction with &REGISTRAR;.</p>
</section1>
<section1 topic='Acknowledgements' anchor='acknowledgements'>
<p>
The author would like to thank Guus der Kinderen, Ralph Meijer, Jonas
Schäfer, Lance Stout and others that provided feedback.
</p>
</section1>
</xep>