Merge pull request #1260 from mar-v-in/xep-0426

XEP-0426: Character Counting 0.3.0
2024-11-24 18:22:24 -05:00 · 2023-01-25 16:58:37 +00:00 · 2023-01-25 16:58:37 +00:00 · 117b74c7f8
commit 117b74c7f8
parent 657e36474f 7a54054335
1 changed files with 79 additions and 44 deletions
--- a/xep-0426.xml
+++ b/xep-0426.xml
@ -13,19 +13,21 @@
  </abstract>
  &LEGALNOTICE;
  <number>0426</number>
-  <status>Deferred</status>
+  <status>Experimental</status>
  <type>Informational</type>
  <sig>Standards</sig>
  <approver>Council</approver>
  <dependencies/>
  <supersedes/>
  <supersededby/>
  <shortname>charcount</shortname>
-  <author>
+  &larma;
-    <firstname>Marvin</firstname>
+  <revision>
-    <surname>Wissfeld</surname>
+    <version>0.3.0</version>
-    <email>xsf@larma.de</email>
+    <date>2022-12-27</date>
-    <jid>jabber@larma.de</jid>
+    <initials>lmw</initials>
-  </author>
+    <remark>Added section about subsequences.</remark>
  </revision>
  <revision>
    <version>0.2.0</version>
    <date>2020-01-02</date>
@ -165,9 +167,7 @@
      across platforms and as such should be used with care.
    </p>
  </section2>
-</section1>
+  <section2 topic='Rationale' anchor='rationale'>
 <section1 topic='Rationale' anchor='rationale'>
    <p>
      The most obvious way of counting characters is to count them how humans
      would. This sounds easy when only having western scripts in mind but becomes
@ -206,6 +206,41 @@
      The &w3xml; standard also defines a character as a unicode code point, thus
      counting code points is equivalent to counting XML characters.
    </p>
  </section2>
 </section1>
 <section1 topic='Subsequences' anchor='subsequence'>
  <p>
    When referencing a subsequence of the characters of a message body, the
    begin and end of the subsequence should be provided by two numbers, denoting
    the number of characters (counted as described above) before the begin of the
    subsequence or before the end of the subsequence, respectively. In other
    words, the begin is the index of the first character in the subsequence and
    the end is the index following the last character in the subsequence. That 
    means, if a subsequence covers the full body, its begin should be given as 
    0 and its end should be given as the number of characters in the body.
  </p>
  <section2 topic='Developer notes' anchor='subsequence-developer-notes'>
    <p>
      Subsequence indexing in various programming languages match the convention
      described here. When using Python, the subsequence created by
      <tt>body[begin:end]</tt> matches all requirements of this document.
    </p>
    <p>
      Some programming languages define subsequences by offset and length. In
      this case, begin matchs the offset while end-begin matches the length.
    </p>
  </section2>
  <section2 topic='Rationale' anchor='subsequence-rationale'>
    <p>
      The convention for subsequences was choosen because it has three main
      advantages: It matches subsequence indexing in various programming
      languages, end minus begin of a subsequence equal the length of the
      subsequence and the end of the first of two adjacent subsequence matches the
      begin of the second one.
    </p>
  </section2>
 </section1>
 <section1 topic='Glossary' anchor='glossary'>