Merge branch 'feature/xep-0426'

2025-02-11 21:00:12 -05:00 · 2020-01-02 17:54:52 +01:00 · 2020-01-02 17:54:52 +01:00 · c3eba8e927
commit c3eba8e927
parent b837b325d7 60581844da
1 changed files with 38 additions and 6 deletions
--- a/xep-0426.xml
+++ b/xep-0426.xml
@ -26,6 +26,12 @@
    <email>xsf@larma.de</email>
    <jid>jabber@larma.de</jid>
  </author>
+  <revision>
+    <version>0.2.0</version>
+    <date>2020-01-02</date>
+    <initials>mw</initials>
+    <remark>Include feedback/clarifications from list.</remark>
+  </revision>
  <revision>
    <version>0.1.0</version>
    <date>2019-12-26</date>
@ -43,12 +49,13 @@
 <section1 topic='Introduction' anchor='intro'>
  <p>
    Various use-cases require the possibility to reference a part of the message
-    body. This was realized by providing the offset of the beginning and end of
-    the referenced region as offset from the beginning of the message. XEPs
-    doing so include &xep0372; (and thereof derived &xep0385;) and &xep0394;.
+    body or a specific position in it. This was realized by providing offsets
+    from the beginning of the message (when referencing a region, those offsets
+    would define begin and end of a region). XEPs doing so include &xep0301;,
+    &xep0372; (and thereof derived &xep0385;) and &xep0394;.
  </p>
  <p>
-    For this method, it is highly relevant to decide how to count "characters"
+    For these use-cases, it is highly relevant to decide how to count "characters"
    in a message body. While it at first sounds trivial, there are various ways
    of doing so in modern font systems. The purpose of this XEP is to define how
    characters shall be counted for the purpose of the aforementioned XEPs and
@ -59,8 +66,17 @@
 <section1 topic='Character counting' anchor='counting'>
  <p>
    When counting characters in a body, they shall be counted by their
-    <strong>number of Unicode code points</strong>. Message bodies must not be
-    normalized when counting code points.
+    <strong>number of Unicode code points</strong>. Message bodies must be used
+    as strings of the XML characters (as defined in §2.2 of &w3xml;). This means
+    that, i.e. no Unicode normalization may be performed before determining
+    offsets when receiving or after determining offsets when sending.
+    Any kind of further body processing shall be performed after counting (e.g.
+    <tt>/me·</tt><note>The middle dot is used to represent a space character
+    and is not meant to be taken verbatim.</note> as described in &xep0245; is
+    always counted as 4 characters without considering the sending user's name).
+    All references (as defined in §4.1 of &w3xml;) must be counted by their
+    referenced character(s) and not the reference characters (e.g. the encoded
+    <tt>&amp;amp;</tt> is counted as one decoded character <tt>&amp;</tt>).
  </p>
  <table caption='Example strings and their counted length'>
    <tr>
@ -77,6 +93,13 @@
      <td>13</td>
      <td>13</td>
    </tr>
+    <tr>
+      <td>You &amp; Me</td>
+      <td>8</td>
+      <td>8</td>
+      <td>8</td>
+      <td>8</td>
+    </tr>
    <tr>
      <td>こんにちは世界</td>
      <td>7</td>
@ -180,6 +203,8 @@
    any usable output.
    Counting code points is widely supported in programming languages and can
    easily be implemented for encoded strings when not.
+    The &w3xml; standard also defines a character as a unicode code point, thus
+    counting code points is equivalent to counting XML characters.
  </p>
 </section1>

@ -198,4 +223,11 @@
  <p>This document requires no interaction with &REGISTRAR;.</p>
 </section1>

+<section1 topic='Acknowledgements' anchor='acknowledgements'>
+  <p>
+    The author would like to thank Guus der Kinderen, Ralph Meijer, Jonas
+    Schäfer, Lance Stout and others that provided feedback.
+  </p>
+</section1>
+
 </xep>