From 7a5405433536812476ae9a3a3d73ada45c443ebe Mon Sep 17 00:00:00 2001
From: Marvin W
+ The most obvious way of counting characters is to count them how humans + would. This sounds easy when only having western scripts in mind but becomes + more complicated in other scripts and most importantly is not well-defined + across Unicode versions. New unicode versions regularly added new + possibilities to build grapheme clusters, including from existing code + points. To be forward compatible, counting grapheme clusters, graphemes, + glyphs or similar is thus not an option. + This leaves basically the two options of using the number of code units of + the encoded string or the number of code points. +
++ The main advantage of using the code units would be that those are native to + many programming languages, easing the task for developers. + However programming languages do not share a common encoding for their + string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the + internal encoding from the developer and only presents it in code points), + so there is no best pick here. + If one was to choose an encoding, the best choice would be UTF-8, the native + encoding of XMPP. However this makes counting bytes a more complex task for + programming languages that use a different encoding like UTF-16, as strings + would need to be transcoded first. +
++ Counting code points has the advantage that offset counts cannot point + inside a code point. This could happen when using code units of any encoding + that may use more than one unit to represent a code point (such as UTF-8 and + UTF-16). + If an offset count points inside a code point, that would be an invalid + offset, raising more uncertainty of the correct behavior in such cases. Most + notably the opportunity of splitting (as it exists for grapheme cluster) is + not an option in that case, because splitting a code point would not create + any usable output. + Counting code points is widely supported in programming languages and can + easily be implemented for encoded strings when not. + The &w3xml; standard also defines a character as a unicode code point, thus + counting code points is equivalent to counting XML characters. +
+- The most obvious way of counting characters is to count them how humans - would. This sounds easy when only having western scripts in mind but becomes - more complicated in other scripts and most importantly is not well-defined - across Unicode versions. New unicode versions regularly added new - possibilities to build grapheme clusters, including from existing code - points. To be forward compatible, counting grapheme clusters, graphemes, - glyphs or similar is thus not an option. - This leaves basically the two options of using the number of code units of - the encoded string or the number of code points. -
-- The main advantage of using the code units would be that those are native to - many programming languages, easing the task for developers. - However programming languages do not share a common encoding for their - string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the - internal encoding from the developer and only presents it in code points), - so there is no best pick here. - If one was to choose an encoding, the best choice would be UTF-8, the native - encoding of XMPP. However this makes counting bytes a more complex task for - programming languages that use a different encoding like UTF-16, as strings - would need to be transcoded first. -
-- Counting code points has the advantage that offset counts cannot point - inside a code point. This could happen when using code units of any encoding - that may use more than one unit to represent a code point (such as UTF-8 and - UTF-16). - If an offset count points inside a code point, that would be an invalid - offset, raising more uncertainty of the correct behavior in such cases. Most - notably the opportunity of splitting (as it exists for grapheme cluster) is - not an option in that case, because splitting a code point would not create - any usable output. - Counting code points is widely supported in programming languages and can - easily be implemented for encoded strings when not. - The &w3xml; standard also defines a character as a unicode code point, thus - counting code points is equivalent to counting XML characters. + When referencing a subsequence of the characters of a message body, the + begin and end of the subsequence should be provided by two numbers, denoting + the number of characters (counted as described above) before the begin of the + subsequence or before the end of the subsequence, respectively. In other + words, the begin is the index of the first character in the subsequence and + the end is the index following the last character in the subsequence. That + means, if a subsequence covers the full body, its begin should be given as + 0 and its end should be given as the number of characters in the body.
+ ++ Subsequence indexing in various programming languages match the convention + described here. When using Python, the subsequence created by + body[begin:end] matches all requirements of this document. +
++ Some programming languages define subsequences by offset and length. In + this case, begin matchs the offset while end-begin matches the length. +
++ The convention for subsequences was choosen because it has three main + advantages: It matches subsequence indexing in various programming + languages, end minus begin of a subsequence equal the length of the + subsequence and the end of the first of two adjacent subsequence matches the + begin of the second one. +
+