Merge pull request #1260 from mar-v-in/xep-0426

XEP-0426: Character Counting 0.3.0
2024-11-24 18:22:24 -05:00 · 2023-01-25 16:58:37 +00:00 · 2023-01-25 16:58:37 +00:00 · 117b74c7f8
commit 117b74c7f8
parent 657e36474f 7a54054335
1 changed files with 79 additions and 44 deletions
--- a/xep-0426.xml
+++ b/xep-0426.xml
@ -13,19 +13,21 @@
  </abstract>
  &LEGALNOTICE;
  <number>0426</number>
-  <status>Deferred</status>
+  <status>Experimental</status>
  <type>Informational</type>
  <sig>Standards</sig>
  <approver>Council</approver>
  <dependencies/>
  <supersedes/>
  <supersededby/>
  <shortname>charcount</shortname>
-  <author>
+  &larma;
-    <firstname>Marvin</firstname>
+  <revision>
-    <surname>Wissfeld</surname>
+    <version>0.3.0</version>
-    <email>xsf@larma.de</email>
+    <date>2022-12-27</date>
-    <jid>jabber@larma.de</jid>
+    <initials>lmw</initials>
-  </author>
+    <remark>Added section about subsequences.</remark>
  </revision>
  <revision>
    <version>0.2.0</version>
    <date>2020-01-02</date>
@ -165,47 +167,80 @@
      across platforms and as such should be used with care.
    </p>
  </section2>
  <section2 topic='Rationale' anchor='rationale'>
    <p>
      The most obvious way of counting characters is to count them how humans
      would. This sounds easy when only having western scripts in mind but becomes
      more complicated in other scripts and most importantly is not well-defined
      across Unicode versions. New unicode versions regularly added new
      possibilities to build grapheme clusters, including from existing code
      points. To be forward compatible, counting grapheme clusters, graphemes,
      glyphs or similar is thus not an option.
      This leaves basically the two options of using the number of code units of
      the encoded string or the number of code points.
    </p>
    <p>
      The main advantage of using the code units would be that those are native to
      many programming languages, easing the task for developers.
      However programming languages do not share a common encoding for their
      string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
      internal encoding from the developer and only presents it in code points),
      so there is no best pick here.
      If one was to choose an encoding, the best choice would be UTF-8, the native
      encoding of XMPP. However this makes counting bytes a more complex task for
      programming languages that use a different encoding like UTF-16, as strings
      would need to be transcoded first.
    </p>
    <p>
      Counting code points has the advantage that offset counts cannot point
      inside a code point. This could happen when using code units of any encoding
      that may use more than one unit to represent a code point (such as UTF-8 and
      UTF-16).
      If an offset count points inside a code point, that would be an invalid
      offset, raising more uncertainty of the correct behavior in such cases. Most
      notably the opportunity of splitting (as it exists for grapheme cluster) is
      not an option in that case, because splitting a code point would not create
      any usable output.
      Counting code points is widely supported in programming languages and can
      easily be implemented for encoded strings when not.
      The &w3xml; standard also defines a character as a unicode code point, thus
      counting code points is equivalent to counting XML characters.
    </p>
  </section2>
 </section1>
-<section1 topic='Rationale' anchor='rationale'>
+<section1 topic='Subsequences' anchor='subsequence'>
  <p>
-    The most obvious way of counting characters is to count them how humans
+    When referencing a subsequence of the characters of a message body, the
-    would. This sounds easy when only having western scripts in mind but becomes
+    begin and end of the subsequence should be provided by two numbers, denoting
-    more complicated in other scripts and most importantly is not well-defined
+    the number of characters (counted as described above) before the begin of the
-    across Unicode versions. New unicode versions regularly added new
+    subsequence or before the end of the subsequence, respectively. In other
-    possibilities to build grapheme clusters, including from existing code
+    words, the begin is the index of the first character in the subsequence and
-    points. To be forward compatible, counting grapheme clusters, graphemes,
+    the end is the index following the last character in the subsequence. That 
-    glyphs or similar is thus not an option.
+    means, if a subsequence covers the full body, its begin should be given as 
-    This leaves basically the two options of using the number of code units of
+    0 and its end should be given as the number of characters in the body.
    the encoded string or the number of code points.
  </p>
  <p>
    The main advantage of using the code units would be that those are native to
    many programming languages, easing the task for developers.
    However programming languages do not share a common encoding for their
    string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
    internal encoding from the developer and only presents it in code points),
    so there is no best pick here.
    If one was to choose an encoding, the best choice would be UTF-8, the native
    encoding of XMPP. However this makes counting bytes a more complex task for
    programming languages that use a different encoding like UTF-16, as strings
    would need to be transcoded first.
  </p>
  <p>
    Counting code points has the advantage that offset counts cannot point
    inside a code point. This could happen when using code units of any encoding
    that may use more than one unit to represent a code point (such as UTF-8 and
    UTF-16).
    If an offset count points inside a code point, that would be an invalid
    offset, raising more uncertainty of the correct behavior in such cases. Most
    notably the opportunity of splitting (as it exists for grapheme cluster) is
    not an option in that case, because splitting a code point would not create
    any usable output.
    Counting code points is widely supported in programming languages and can
    easily be implemented for encoded strings when not.
    The &w3xml; standard also defines a character as a unicode code point, thus
    counting code points is equivalent to counting XML characters.
  </p>
  <section2 topic='Developer notes' anchor='subsequence-developer-notes'>
    <p>
      Subsequence indexing in various programming languages match the convention
      described here. When using Python, the subsequence created by
      <tt>body[begin:end]</tt> matches all requirements of this document.
    </p>
    <p>
      Some programming languages define subsequences by offset and length. In
      this case, begin matchs the offset while end-begin matches the length.
    </p>
  </section2>
  <section2 topic='Rationale' anchor='subsequence-rationale'>
    <p>
      The convention for subsequences was choosen because it has three main
      advantages: It matches subsequence indexing in various programming
      languages, end minus begin of a subsequence equal the length of the
      subsequence and the end of the first of two adjacent subsequence matches the
      begin of the second one.
    </p>
  </section2>
 </section1>
 <section1 topic='Glossary' anchor='glossary'>