diff --git a/inbox/charcount.xml b/inbox/charcount.xml new file mode 100644 index 00000000..36c07288 --- /dev/null +++ b/inbox/charcount.xml @@ -0,0 +1,195 @@ + + + %ents; +]> + + +
+ Character counting in message bodies + + This document describes how to correctly count characters in message bodies. + This is required when referencing a position in the body. + + &LEGALNOTICE; + XXXX + Experimental + Informational + Standards + + + + charcount + + Marvin + Wissfeld + xsf@larma.de + jabber@larma.de + + + 0.0.1 + 2019-12-15 + mw + Initial attempt to finalize the discussions. + +
+ + +

+ Various use-cases require the possibility to reference a part of the message + body. This was realized by providing the offset of the beginning and end of + the referenced region as offset from the beginning of the message. XEPs + doing so include &xep0372; (and thereof derived &xep0385;) and &xep0394;. +

+

+ For this method, it is highly relevant to decide how to count "characters" + in a message body. While it at first sounds trivial, there are various ways + of doing so in modern font systems. The purpose of this XEP is to define how + characters shall be counted for the purpose of the aforementioned XEPs and + any future XEP relying on a similar feature. +

+
+ + +

+ When counting characters in a body, they shall be counted by their + number of Unicode code points. Message bodies must not be + normalized when counting code points. +

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
StringGrapheme clusterUTF-8 bytesUTF-16 units (2 bytes)Code points
Hello, world!13131313
こんにちは世界72177
🧛🏾 👨‍👨‍👦‍👦 🇺🇳 + 5 + + There are spaces between the emojis. You may also perceive this as + more than 5 glyphs if your font or display engine does not support + the required Unicode version. + + 432113
+ +

+ As grapheme clusters may consist of multiple code points, a code point + offset might be illegal if it points inside a grapheme cluster. +

+

+ However, receiving entities SHOULD NOT consider illegal offsets invalid, + as different Unicode versions may have different understanding of what a + grapheme cluster is. Instead, receiving entities may choose one of the + following behaviors: +

+
    +
  • + Split the grapheme cluster into multiple graphemes. In most cases, this + is closest to the intended behavior. Many font display engines will do + this automatically as needed. +
  • +
  • + When the offset defines the end of a region, include the full grapheme + cluster in the region. Otherwise, take the offset as if it pointed to + the beginning of the grapheme cluster. +
  • +
+
+ +

+ Some programming languages include a string type that operates directly on + Unicode code points. If these types are used, offset numbers can be used + as-is in string operations. Popular examples of such programming languages + are Python and Haskell. +

+

+ Other programming languages include a string type that operates on UTF-16 + units. As can be seen in the table above, those match the number of code + points in many cases and thus are sometimes confused to be the same. + Popular examples of such programming languages are C#, Java and + JavaScript. +

+

+ C/C++ includes a wide character and string type. Those behave differently + across platforms and as such should be used with care. +

+
+
+ + +

+ The most obvious way of counting characters is to count them how humans + would. This sounds easy when only having western scripts in mind but becomes + more complicated in other scripts and most importantly is not well-defined + across Unicode versions. New unicode versions regularly added new + possibilities to build grapheme clusters, including from existing code + points. To be forward compatible, counting grapheme clusters, graphemes, + glyphs or similar is thus not an option. + This leaves basically the two options of using the number of code units of + the encoded string or the number of code points. +

+

+ The main advantage of using the code units would be that those are native to + many programming languages, easing the task for developers. + However programming languages do not share a common encoding for their + string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the + internal encoding from the developer and only presents it in code points), + so there is no best pick here. + If one was to choose an encoding, the best choice would be UTF-8, the native + encoding of XMPP. However this makes counting bytes a more complex task for + programming languages that use a different encoding like UTF-16, as strings + would need to be transcoded first. +

+

+ Counting code points has the advantage that offset counts cannot point + inside a code point. This could happen when using code units of any encoding + that may use more than one unit to represent a code point (such as UTF-8 and + UTF-16). + If an offset count points inside a code point, that would be an invalid + offset, raising more uncertainty of the correct behavior in such cases. Most + notably the opportunity of splitting (as it exists for grapheme cluster) is + not an option in that case, because splitting a code point would not create + any usable output. + Counting code points is widely supported in programming languages and can + easily be implemented for encoded strings when not. +

+
+ + +

+ Unicode terminology used across this document, can be looked up in the + Unicode glossary at https://www.unicode.org/glossary/. +

+
+ + +

This document requires no interaction with &IANA;.

+
+ + +

This document requires no interaction with ®ISTRAR;.

+
+ +