mirror of
https://github.com/moparisthebest/xeps
synced 2024-11-21 08:45:04 -05:00
XEP-0426: Character Counting 0.3.0
Added section about subsequences.
This commit is contained in:
parent
d79c8fafb6
commit
7a54054335
123
xep-0426.xml
123
xep-0426.xml
@ -13,19 +13,21 @@
|
||||
</abstract>
|
||||
&LEGALNOTICE;
|
||||
<number>0426</number>
|
||||
<status>Deferred</status>
|
||||
<status>Experimental</status>
|
||||
<type>Informational</type>
|
||||
<sig>Standards</sig>
|
||||
<approver>Council</approver>
|
||||
<dependencies/>
|
||||
<supersedes/>
|
||||
<supersededby/>
|
||||
<shortname>charcount</shortname>
|
||||
<author>
|
||||
<firstname>Marvin</firstname>
|
||||
<surname>Wissfeld</surname>
|
||||
<email>xsf@larma.de</email>
|
||||
<jid>jabber@larma.de</jid>
|
||||
</author>
|
||||
&larma;
|
||||
<revision>
|
||||
<version>0.3.0</version>
|
||||
<date>2022-12-27</date>
|
||||
<initials>lmw</initials>
|
||||
<remark>Added section about subsequences.</remark>
|
||||
</revision>
|
||||
<revision>
|
||||
<version>0.2.0</version>
|
||||
<date>2020-01-02</date>
|
||||
@ -165,47 +167,80 @@
|
||||
across platforms and as such should be used with care.
|
||||
</p>
|
||||
</section2>
|
||||
<section2 topic='Rationale' anchor='rationale'>
|
||||
<p>
|
||||
The most obvious way of counting characters is to count them how humans
|
||||
would. This sounds easy when only having western scripts in mind but becomes
|
||||
more complicated in other scripts and most importantly is not well-defined
|
||||
across Unicode versions. New unicode versions regularly added new
|
||||
possibilities to build grapheme clusters, including from existing code
|
||||
points. To be forward compatible, counting grapheme clusters, graphemes,
|
||||
glyphs or similar is thus not an option.
|
||||
This leaves basically the two options of using the number of code units of
|
||||
the encoded string or the number of code points.
|
||||
</p>
|
||||
<p>
|
||||
The main advantage of using the code units would be that those are native to
|
||||
many programming languages, easing the task for developers.
|
||||
However programming languages do not share a common encoding for their
|
||||
string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
|
||||
internal encoding from the developer and only presents it in code points),
|
||||
so there is no best pick here.
|
||||
If one was to choose an encoding, the best choice would be UTF-8, the native
|
||||
encoding of XMPP. However this makes counting bytes a more complex task for
|
||||
programming languages that use a different encoding like UTF-16, as strings
|
||||
would need to be transcoded first.
|
||||
</p>
|
||||
<p>
|
||||
Counting code points has the advantage that offset counts cannot point
|
||||
inside a code point. This could happen when using code units of any encoding
|
||||
that may use more than one unit to represent a code point (such as UTF-8 and
|
||||
UTF-16).
|
||||
If an offset count points inside a code point, that would be an invalid
|
||||
offset, raising more uncertainty of the correct behavior in such cases. Most
|
||||
notably the opportunity of splitting (as it exists for grapheme cluster) is
|
||||
not an option in that case, because splitting a code point would not create
|
||||
any usable output.
|
||||
Counting code points is widely supported in programming languages and can
|
||||
easily be implemented for encoded strings when not.
|
||||
The &w3xml; standard also defines a character as a unicode code point, thus
|
||||
counting code points is equivalent to counting XML characters.
|
||||
</p>
|
||||
</section2>
|
||||
</section1>
|
||||
|
||||
<section1 topic='Rationale' anchor='rationale'>
|
||||
<section1 topic='Subsequences' anchor='subsequence'>
|
||||
<p>
|
||||
The most obvious way of counting characters is to count them how humans
|
||||
would. This sounds easy when only having western scripts in mind but becomes
|
||||
more complicated in other scripts and most importantly is not well-defined
|
||||
across Unicode versions. New unicode versions regularly added new
|
||||
possibilities to build grapheme clusters, including from existing code
|
||||
points. To be forward compatible, counting grapheme clusters, graphemes,
|
||||
glyphs or similar is thus not an option.
|
||||
This leaves basically the two options of using the number of code units of
|
||||
the encoded string or the number of code points.
|
||||
</p>
|
||||
<p>
|
||||
The main advantage of using the code units would be that those are native to
|
||||
many programming languages, easing the task for developers.
|
||||
However programming languages do not share a common encoding for their
|
||||
string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
|
||||
internal encoding from the developer and only presents it in code points),
|
||||
so there is no best pick here.
|
||||
If one was to choose an encoding, the best choice would be UTF-8, the native
|
||||
encoding of XMPP. However this makes counting bytes a more complex task for
|
||||
programming languages that use a different encoding like UTF-16, as strings
|
||||
would need to be transcoded first.
|
||||
</p>
|
||||
<p>
|
||||
Counting code points has the advantage that offset counts cannot point
|
||||
inside a code point. This could happen when using code units of any encoding
|
||||
that may use more than one unit to represent a code point (such as UTF-8 and
|
||||
UTF-16).
|
||||
If an offset count points inside a code point, that would be an invalid
|
||||
offset, raising more uncertainty of the correct behavior in such cases. Most
|
||||
notably the opportunity of splitting (as it exists for grapheme cluster) is
|
||||
not an option in that case, because splitting a code point would not create
|
||||
any usable output.
|
||||
Counting code points is widely supported in programming languages and can
|
||||
easily be implemented for encoded strings when not.
|
||||
The &w3xml; standard also defines a character as a unicode code point, thus
|
||||
counting code points is equivalent to counting XML characters.
|
||||
When referencing a subsequence of the characters of a message body, the
|
||||
begin and end of the subsequence should be provided by two numbers, denoting
|
||||
the number of characters (counted as described above) before the begin of the
|
||||
subsequence or before the end of the subsequence, respectively. In other
|
||||
words, the begin is the index of the first character in the subsequence and
|
||||
the end is the index following the last character in the subsequence. That
|
||||
means, if a subsequence covers the full body, its begin should be given as
|
||||
0 and its end should be given as the number of characters in the body.
|
||||
</p>
|
||||
|
||||
<section2 topic='Developer notes' anchor='subsequence-developer-notes'>
|
||||
<p>
|
||||
Subsequence indexing in various programming languages match the convention
|
||||
described here. When using Python, the subsequence created by
|
||||
<tt>body[begin:end]</tt> matches all requirements of this document.
|
||||
</p>
|
||||
<p>
|
||||
Some programming languages define subsequences by offset and length. In
|
||||
this case, begin matchs the offset while end-begin matches the length.
|
||||
</p>
|
||||
</section2>
|
||||
<section2 topic='Rationale' anchor='subsequence-rationale'>
|
||||
<p>
|
||||
The convention for subsequences was choosen because it has three main
|
||||
advantages: It matches subsequence indexing in various programming
|
||||
languages, end minus begin of a subsequence equal the length of the
|
||||
subsequence and the end of the first of two adjacent subsequence matches the
|
||||
begin of the second one.
|
||||
</p>
|
||||
</section2>
|
||||
</section1>
|
||||
|
||||
<section1 topic='Glossary' anchor='glossary'>
|
||||
|
Loading…
Reference in New Issue
Block a user