XEP-0426: Character Counting 0.3.0

Added section about subsequences.
2024-11-21 08:45:04 -05:00 · 2022-12-27 22:12:17 +01:00 · 2022-12-27 22:12:17 +01:00 · 7a54054335
commit 7a54054335
parent d79c8fafb6
1 changed files with 79 additions and 44 deletions
--- a/xep-0426.xml
+++ b/xep-0426.xml
@ -13,19 +13,21 @@
  </abstract>
  &LEGALNOTICE;
  <number>0426</number>
-  <status>Deferred</status>
+  <status>Experimental</status>
  <type>Informational</type>
  <sig>Standards</sig>
+  <approver>Council</approver>
  <dependencies/>
  <supersedes/>
  <supersededby/>
  <shortname>charcount</shortname>
-  <author>
-    <firstname>Marvin</firstname>
-    <surname>Wissfeld</surname>
-    <email>xsf@larma.de</email>
-    <jid>jabber@larma.de</jid>
-  </author>
+  &larma;
+  <revision>
+    <version>0.3.0</version>
+    <date>2022-12-27</date>
+    <initials>lmw</initials>
+    <remark>Added section about subsequences.</remark>
+  </revision>
  <revision>
    <version>0.2.0</version>
    <date>2020-01-02</date>
@ -165,47 +167,80 @@
      across platforms and as such should be used with care.
    </p>
  </section2>
+  <section2 topic='Rationale' anchor='rationale'>
+    <p>
+      The most obvious way of counting characters is to count them how humans
+      would. This sounds easy when only having western scripts in mind but becomes
+      more complicated in other scripts and most importantly is not well-defined
+      across Unicode versions. New unicode versions regularly added new
+      possibilities to build grapheme clusters, including from existing code
+      points. To be forward compatible, counting grapheme clusters, graphemes,
+      glyphs or similar is thus not an option.
+      This leaves basically the two options of using the number of code units of
+      the encoded string or the number of code points.
+    </p>
+    <p>
+      The main advantage of using the code units would be that those are native to
+      many programming languages, easing the task for developers.
+      However programming languages do not share a common encoding for their
+      string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
+      internal encoding from the developer and only presents it in code points),
+      so there is no best pick here.
+      If one was to choose an encoding, the best choice would be UTF-8, the native
+      encoding of XMPP. However this makes counting bytes a more complex task for
+      programming languages that use a different encoding like UTF-16, as strings
+      would need to be transcoded first.
+    </p>
+    <p>
+      Counting code points has the advantage that offset counts cannot point
+      inside a code point. This could happen when using code units of any encoding
+      that may use more than one unit to represent a code point (such as UTF-8 and
+      UTF-16).
+      If an offset count points inside a code point, that would be an invalid
+      offset, raising more uncertainty of the correct behavior in such cases. Most
+      notably the opportunity of splitting (as it exists for grapheme cluster) is
+      not an option in that case, because splitting a code point would not create
+      any usable output.
+      Counting code points is widely supported in programming languages and can
+      easily be implemented for encoded strings when not.
+      The &w3xml; standard also defines a character as a unicode code point, thus
+      counting code points is equivalent to counting XML characters.
+    </p>
+  </section2>
 </section1>

-<section1 topic='Rationale' anchor='rationale'>
+<section1 topic='Subsequences' anchor='subsequence'>
  <p>
-    The most obvious way of counting characters is to count them how humans
-    would. This sounds easy when only having western scripts in mind but becomes
-    more complicated in other scripts and most importantly is not well-defined
-    across Unicode versions. New unicode versions regularly added new
-    possibilities to build grapheme clusters, including from existing code
-    points. To be forward compatible, counting grapheme clusters, graphemes,
-    glyphs or similar is thus not an option.
-    This leaves basically the two options of using the number of code units of
-    the encoded string or the number of code points.
-  </p>
-  <p>
-    The main advantage of using the code units would be that those are native to
-    many programming languages, easing the task for developers.
-    However programming languages do not share a common encoding for their
-    string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
-    internal encoding from the developer and only presents it in code points),
-    so there is no best pick here.
-    If one was to choose an encoding, the best choice would be UTF-8, the native
-    encoding of XMPP. However this makes counting bytes a more complex task for
-    programming languages that use a different encoding like UTF-16, as strings
-    would need to be transcoded first.
-  </p>
-  <p>
-    Counting code points has the advantage that offset counts cannot point
-    inside a code point. This could happen when using code units of any encoding
-    that may use more than one unit to represent a code point (such as UTF-8 and
-    UTF-16).
-    If an offset count points inside a code point, that would be an invalid
-    offset, raising more uncertainty of the correct behavior in such cases. Most
-    notably the opportunity of splitting (as it exists for grapheme cluster) is
-    not an option in that case, because splitting a code point would not create
-    any usable output.
-    Counting code points is widely supported in programming languages and can
-    easily be implemented for encoded strings when not.
-    The &w3xml; standard also defines a character as a unicode code point, thus
-    counting code points is equivalent to counting XML characters.
+    When referencing a subsequence of the characters of a message body, the
+    begin and end of the subsequence should be provided by two numbers, denoting
+    the number of characters (counted as described above) before the begin of the
+    subsequence or before the end of the subsequence, respectively. In other
+    words, the begin is the index of the first character in the subsequence and
+    the end is the index following the last character in the subsequence. That 
+    means, if a subsequence covers the full body, its begin should be given as 
+    0 and its end should be given as the number of characters in the body.
  </p>
+
+  <section2 topic='Developer notes' anchor='subsequence-developer-notes'>
+    <p>
+      Subsequence indexing in various programming languages match the convention
+      described here. When using Python, the subsequence created by
+      <tt>body[begin:end]</tt> matches all requirements of this document.
+    </p>
+    <p>
+      Some programming languages define subsequences by offset and length. In
+      this case, begin matchs the offset while end-begin matches the length.
+    </p>
+  </section2>
+  <section2 topic='Rationale' anchor='subsequence-rationale'>
+    <p>
+      The convention for subsequences was choosen because it has three main
+      advantages: It matches subsequence indexing in various programming
+      languages, end minus begin of a subsequence equal the length of the
+      subsequence and the end of the first of two adjacent subsequence matches the
+      begin of the second one.
+    </p>
+  </section2>
 </section1>

 <section1 topic='Glossary' anchor='glossary'>