Merge branch 'feature/xep-0426'

2024-11-24 02:02:16 -05:00 · 2020-01-02 17:54:52 +01:00 · 2020-01-02 17:54:52 +01:00 · c3eba8e927
commit c3eba8e927
parent b837b325d7 60581844da
1 changed files with 38 additions and 6 deletions
--- a/xep-0426.xml
+++ b/xep-0426.xml
@ -26,6 +26,12 @@
    <email>xsf@larma.de</email>
    <jid>jabber@larma.de</jid>
  </author>
  <revision>
    <version>0.2.0</version>
    <date>2020-01-02</date>
    <initials>mw</initials>
    <remark>Include feedback/clarifications from list.</remark>
  </revision>
  <revision>
    <version>0.1.0</version>
    <date>2019-12-26</date>
@ -43,12 +49,13 @@
 <section1 topic='Introduction' anchor='intro'>
  <p>
    Various use-cases require the possibility to reference a part of the message
-    body. This was realized by providing the offset of the beginning and end of
+    body or a specific position in it. This was realized by providing offsets
-    the referenced region as offset from the beginning of the message. XEPs
+    from the beginning of the message (when referencing a region, those offsets
-    doing so include &xep0372; (and thereof derived &xep0385;) and &xep0394;.
+    would define begin and end of a region). XEPs doing so include &xep0301;,
    &xep0372; (and thereof derived &xep0385;) and &xep0394;.
  </p>
  <p>
-    For this method, it is highly relevant to decide how to count "characters"
+    For these use-cases, it is highly relevant to decide how to count "characters"
    in a message body. While it at first sounds trivial, there are various ways
    of doing so in modern font systems. The purpose of this XEP is to define how
    characters shall be counted for the purpose of the aforementioned XEPs and
@ -59,8 +66,17 @@
 <section1 topic='Character counting' anchor='counting'>
  <p>
    When counting characters in a body, they shall be counted by their
-    <strong>number of Unicode code points</strong>. Message bodies must not be
+    <strong>number of Unicode code points</strong>. Message bodies must be used
-    normalized when counting code points.
+    as strings of the XML characters (as defined in §2.2 of &w3xml;). This means
    that, i.e. no Unicode normalization may be performed before determining
    offsets when receiving or after determining offsets when sending.
    Any kind of further body processing shall be performed after counting (e.g.
    <tt>/me·</tt><note>The middle dot is used to represent a space character
    and is not meant to be taken verbatim.</note> as described in &xep0245; is
    always counted as 4 characters without considering the sending user's name).
    All references (as defined in §4.1 of &w3xml;) must be counted by their
    referenced character(s) and not the reference characters (e.g. the encoded
    <tt>&amp;amp;</tt> is counted as one decoded character <tt>&amp;</tt>).
  </p>
  <table caption='Example strings and their counted length'>
    <tr>
@ -77,6 +93,13 @@
      <td>13</td>
      <td>13</td>
    </tr>
    <tr>
      <td>You &amp; Me</td>
      <td>8</td>
      <td>8</td>
      <td>8</td>
      <td>8</td>
    </tr>
    <tr>
      <td>こんにちは世界</td>
      <td>7</td>
@ -180,6 +203,8 @@
    any usable output.
    Counting code points is widely supported in programming languages and can
    easily be implemented for encoded strings when not.
    The &w3xml; standard also defines a character as a unicode code point, thus
    counting code points is equivalent to counting XML characters.
  </p>
 </section1>
@ -198,4 +223,11 @@
  <p>This document requires no interaction with &REGISTRAR;.</p>
 </section1>
 <section1 topic='Acknowledgements' anchor='acknowledgements'>
  <p>
    The author would like to thank Guus der Kinderen, Ralph Meijer, Jonas
    Schäfer, Lance Stout and others that provided feedback.
  </p>
 </section1>
 </xep>