From 7a5405433536812476ae9a3a3d73ada45c443ebe Mon Sep 17 00:00:00 2001 From: Marvin W Date: Tue, 27 Dec 2022 22:12:17 +0100 Subject: [PATCH] XEP-0426: Character Counting 0.3.0 Added section about subsequences. --- xep-0426.xml | 123 +++++++++++++++++++++++++++++++++------------------ 1 file changed, 79 insertions(+), 44 deletions(-) diff --git a/xep-0426.xml b/xep-0426.xml index 2e2c0ecf..b9876acf 100644 --- a/xep-0426.xml +++ b/xep-0426.xml @@ -13,19 +13,21 @@ &LEGALNOTICE; 0426 - Deferred + Experimental Informational Standards + Council charcount - - Marvin - Wissfeld - xsf@larma.de - jabber@larma.de - + &larma; + + 0.3.0 + 2022-12-27 + lmw + Added section about subsequences. + 0.2.0 2020-01-02 @@ -165,47 +167,80 @@ across platforms and as such should be used with care.

+ +

+ The most obvious way of counting characters is to count them how humans + would. This sounds easy when only having western scripts in mind but becomes + more complicated in other scripts and most importantly is not well-defined + across Unicode versions. New unicode versions regularly added new + possibilities to build grapheme clusters, including from existing code + points. To be forward compatible, counting grapheme clusters, graphemes, + glyphs or similar is thus not an option. + This leaves basically the two options of using the number of code units of + the encoded string or the number of code points. +

+

+ The main advantage of using the code units would be that those are native to + many programming languages, easing the task for developers. + However programming languages do not share a common encoding for their + string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the + internal encoding from the developer and only presents it in code points), + so there is no best pick here. + If one was to choose an encoding, the best choice would be UTF-8, the native + encoding of XMPP. However this makes counting bytes a more complex task for + programming languages that use a different encoding like UTF-16, as strings + would need to be transcoded first. +

+

+ Counting code points has the advantage that offset counts cannot point + inside a code point. This could happen when using code units of any encoding + that may use more than one unit to represent a code point (such as UTF-8 and + UTF-16). + If an offset count points inside a code point, that would be an invalid + offset, raising more uncertainty of the correct behavior in such cases. Most + notably the opportunity of splitting (as it exists for grapheme cluster) is + not an option in that case, because splitting a code point would not create + any usable output. + Counting code points is widely supported in programming languages and can + easily be implemented for encoded strings when not. + The &w3xml; standard also defines a character as a unicode code point, thus + counting code points is equivalent to counting XML characters. +

+
- +

- The most obvious way of counting characters is to count them how humans - would. This sounds easy when only having western scripts in mind but becomes - more complicated in other scripts and most importantly is not well-defined - across Unicode versions. New unicode versions regularly added new - possibilities to build grapheme clusters, including from existing code - points. To be forward compatible, counting grapheme clusters, graphemes, - glyphs or similar is thus not an option. - This leaves basically the two options of using the number of code units of - the encoded string or the number of code points. -

-

- The main advantage of using the code units would be that those are native to - many programming languages, easing the task for developers. - However programming languages do not share a common encoding for their - string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the - internal encoding from the developer and only presents it in code points), - so there is no best pick here. - If one was to choose an encoding, the best choice would be UTF-8, the native - encoding of XMPP. However this makes counting bytes a more complex task for - programming languages that use a different encoding like UTF-16, as strings - would need to be transcoded first. -

-

- Counting code points has the advantage that offset counts cannot point - inside a code point. This could happen when using code units of any encoding - that may use more than one unit to represent a code point (such as UTF-8 and - UTF-16). - If an offset count points inside a code point, that would be an invalid - offset, raising more uncertainty of the correct behavior in such cases. Most - notably the opportunity of splitting (as it exists for grapheme cluster) is - not an option in that case, because splitting a code point would not create - any usable output. - Counting code points is widely supported in programming languages and can - easily be implemented for encoded strings when not. - The &w3xml; standard also defines a character as a unicode code point, thus - counting code points is equivalent to counting XML characters. + When referencing a subsequence of the characters of a message body, the + begin and end of the subsequence should be provided by two numbers, denoting + the number of characters (counted as described above) before the begin of the + subsequence or before the end of the subsequence, respectively. In other + words, the begin is the index of the first character in the subsequence and + the end is the index following the last character in the subsequence. That + means, if a subsequence covers the full body, its begin should be given as + 0 and its end should be given as the number of characters in the body.

+ + +

+ Subsequence indexing in various programming languages match the convention + described here. When using Python, the subsequence created by + body[begin:end] matches all requirements of this document. +

+

+ Some programming languages define subsequences by offset and length. In + this case, begin matchs the offset while end-begin matches the length. +

+
+ +

+ The convention for subsequences was choosen because it has three main + advantages: It matches subsequence indexing in various programming + languages, end minus begin of a subsequence equal the length of the + subsequence and the end of the first of two adjacent subsequence matches the + begin of the second one. +

+