XEP-0426: Character Counting 0.3.0

Added section about subsequences.
2024-11-21 08:45:04 -05:00 · 2022-12-27 22:12:17 +01:00 · 2022-12-27 22:12:17 +01:00 · 7a54054335
commit 7a54054335
parent d79c8fafb6
1 changed files with 79 additions and 44 deletions
--- a/xep-0426.xml
+++ b/xep-0426.xml
@ -13,19 +13,21 @@
  </abstract>
  &LEGALNOTICE;
  <number>0426</number>
-  <status>Deferred</status>
+  <status>Experimental</status>
  <type>Informational</type>
  <sig>Standards</sig>
+  <approver>Council</approver>
  <dependencies/>
  <supersedes/>
  <supersededby/>
  <shortname>charcount</shortname>
-  <author>
-    <firstname>Marvin</firstname>
-    <surname>Wissfeld</surname>
-    <email>xsf@larma.de</email>
-    <jid>jabber@larma.de</jid>
-  </author>
+  &larma;
+  <revision>
+    <version>0.3.0</version>
+    <date>2022-12-27</date>
+    <initials>lmw</initials>
+    <remark>Added section about subsequences.</remark>
+  </revision>
  <revision>
    <version>0.2.0</version>
    <date>2020-01-02</date>
@ -165,9 +167,7 @@
      across platforms and as such should be used with care.
    </p>
  </section2>
-</section1>
-
-<section1 topic='Rationale' anchor='rationale'>
+  <section2 topic='Rationale' anchor='rationale'>
    <p>
      The most obvious way of counting characters is to count them how humans
      would. This sounds easy when only having western scripts in mind but becomes
@ -206,6 +206,41 @@
      The &w3xml; standard also defines a character as a unicode code point, thus
      counting code points is equivalent to counting XML characters.
    </p>
+  </section2>
+</section1>
+
+<section1 topic='Subsequences' anchor='subsequence'>
+  <p>
+    When referencing a subsequence of the characters of a message body, the
+    begin and end of the subsequence should be provided by two numbers, denoting
+    the number of characters (counted as described above) before the begin of the
+    subsequence or before the end of the subsequence, respectively. In other
+    words, the begin is the index of the first character in the subsequence and
+    the end is the index following the last character in the subsequence. That 
+    means, if a subsequence covers the full body, its begin should be given as 
+    0 and its end should be given as the number of characters in the body.
+  </p>
+
+  <section2 topic='Developer notes' anchor='subsequence-developer-notes'>
+    <p>
+      Subsequence indexing in various programming languages match the convention
+      described here. When using Python, the subsequence created by
+      <tt>body[begin:end]</tt> matches all requirements of this document.
+    </p>
+    <p>
+      Some programming languages define subsequences by offset and length. In
+      this case, begin matchs the offset while end-begin matches the length.
+    </p>
+  </section2>
+  <section2 topic='Rationale' anchor='subsequence-rationale'>
+    <p>
+      The convention for subsequences was choosen because it has three main
+      advantages: It matches subsequence indexing in various programming
+      languages, end minus begin of a subsequence equal the length of the
+      subsequence and the end of the first of two adjacent subsequence matches the
+      begin of the second one.
+    </p>
+  </section2>
 </section1>

 <section1 topic='Glossary' anchor='glossary'>