mirror of
https://github.com/moparisthebest/xeps
synced 2024-11-21 16:55:07 -05:00
XEP-0426: Character Counting 0.3.0
Added section about subsequences.
This commit is contained in:
parent
d79c8fafb6
commit
7a54054335
123
xep-0426.xml
123
xep-0426.xml
@ -13,19 +13,21 @@
|
|||||||
</abstract>
|
</abstract>
|
||||||
&LEGALNOTICE;
|
&LEGALNOTICE;
|
||||||
<number>0426</number>
|
<number>0426</number>
|
||||||
<status>Deferred</status>
|
<status>Experimental</status>
|
||||||
<type>Informational</type>
|
<type>Informational</type>
|
||||||
<sig>Standards</sig>
|
<sig>Standards</sig>
|
||||||
|
<approver>Council</approver>
|
||||||
<dependencies/>
|
<dependencies/>
|
||||||
<supersedes/>
|
<supersedes/>
|
||||||
<supersededby/>
|
<supersededby/>
|
||||||
<shortname>charcount</shortname>
|
<shortname>charcount</shortname>
|
||||||
<author>
|
&larma;
|
||||||
<firstname>Marvin</firstname>
|
<revision>
|
||||||
<surname>Wissfeld</surname>
|
<version>0.3.0</version>
|
||||||
<email>xsf@larma.de</email>
|
<date>2022-12-27</date>
|
||||||
<jid>jabber@larma.de</jid>
|
<initials>lmw</initials>
|
||||||
</author>
|
<remark>Added section about subsequences.</remark>
|
||||||
|
</revision>
|
||||||
<revision>
|
<revision>
|
||||||
<version>0.2.0</version>
|
<version>0.2.0</version>
|
||||||
<date>2020-01-02</date>
|
<date>2020-01-02</date>
|
||||||
@ -165,47 +167,80 @@
|
|||||||
across platforms and as such should be used with care.
|
across platforms and as such should be used with care.
|
||||||
</p>
|
</p>
|
||||||
</section2>
|
</section2>
|
||||||
|
<section2 topic='Rationale' anchor='rationale'>
|
||||||
|
<p>
|
||||||
|
The most obvious way of counting characters is to count them how humans
|
||||||
|
would. This sounds easy when only having western scripts in mind but becomes
|
||||||
|
more complicated in other scripts and most importantly is not well-defined
|
||||||
|
across Unicode versions. New unicode versions regularly added new
|
||||||
|
possibilities to build grapheme clusters, including from existing code
|
||||||
|
points. To be forward compatible, counting grapheme clusters, graphemes,
|
||||||
|
glyphs or similar is thus not an option.
|
||||||
|
This leaves basically the two options of using the number of code units of
|
||||||
|
the encoded string or the number of code points.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
The main advantage of using the code units would be that those are native to
|
||||||
|
many programming languages, easing the task for developers.
|
||||||
|
However programming languages do not share a common encoding for their
|
||||||
|
string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
|
||||||
|
internal encoding from the developer and only presents it in code points),
|
||||||
|
so there is no best pick here.
|
||||||
|
If one was to choose an encoding, the best choice would be UTF-8, the native
|
||||||
|
encoding of XMPP. However this makes counting bytes a more complex task for
|
||||||
|
programming languages that use a different encoding like UTF-16, as strings
|
||||||
|
would need to be transcoded first.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
Counting code points has the advantage that offset counts cannot point
|
||||||
|
inside a code point. This could happen when using code units of any encoding
|
||||||
|
that may use more than one unit to represent a code point (such as UTF-8 and
|
||||||
|
UTF-16).
|
||||||
|
If an offset count points inside a code point, that would be an invalid
|
||||||
|
offset, raising more uncertainty of the correct behavior in such cases. Most
|
||||||
|
notably the opportunity of splitting (as it exists for grapheme cluster) is
|
||||||
|
not an option in that case, because splitting a code point would not create
|
||||||
|
any usable output.
|
||||||
|
Counting code points is widely supported in programming languages and can
|
||||||
|
easily be implemented for encoded strings when not.
|
||||||
|
The &w3xml; standard also defines a character as a unicode code point, thus
|
||||||
|
counting code points is equivalent to counting XML characters.
|
||||||
|
</p>
|
||||||
|
</section2>
|
||||||
</section1>
|
</section1>
|
||||||
|
|
||||||
<section1 topic='Rationale' anchor='rationale'>
|
<section1 topic='Subsequences' anchor='subsequence'>
|
||||||
<p>
|
<p>
|
||||||
The most obvious way of counting characters is to count them how humans
|
When referencing a subsequence of the characters of a message body, the
|
||||||
would. This sounds easy when only having western scripts in mind but becomes
|
begin and end of the subsequence should be provided by two numbers, denoting
|
||||||
more complicated in other scripts and most importantly is not well-defined
|
the number of characters (counted as described above) before the begin of the
|
||||||
across Unicode versions. New unicode versions regularly added new
|
subsequence or before the end of the subsequence, respectively. In other
|
||||||
possibilities to build grapheme clusters, including from existing code
|
words, the begin is the index of the first character in the subsequence and
|
||||||
points. To be forward compatible, counting grapheme clusters, graphemes,
|
the end is the index following the last character in the subsequence. That
|
||||||
glyphs or similar is thus not an option.
|
means, if a subsequence covers the full body, its begin should be given as
|
||||||
This leaves basically the two options of using the number of code units of
|
0 and its end should be given as the number of characters in the body.
|
||||||
the encoded string or the number of code points.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
The main advantage of using the code units would be that those are native to
|
|
||||||
many programming languages, easing the task for developers.
|
|
||||||
However programming languages do not share a common encoding for their
|
|
||||||
string type (C/C++ use UTF-8, C#/Java use UTF-16, Python 3 hides the
|
|
||||||
internal encoding from the developer and only presents it in code points),
|
|
||||||
so there is no best pick here.
|
|
||||||
If one was to choose an encoding, the best choice would be UTF-8, the native
|
|
||||||
encoding of XMPP. However this makes counting bytes a more complex task for
|
|
||||||
programming languages that use a different encoding like UTF-16, as strings
|
|
||||||
would need to be transcoded first.
|
|
||||||
</p>
|
|
||||||
<p>
|
|
||||||
Counting code points has the advantage that offset counts cannot point
|
|
||||||
inside a code point. This could happen when using code units of any encoding
|
|
||||||
that may use more than one unit to represent a code point (such as UTF-8 and
|
|
||||||
UTF-16).
|
|
||||||
If an offset count points inside a code point, that would be an invalid
|
|
||||||
offset, raising more uncertainty of the correct behavior in such cases. Most
|
|
||||||
notably the opportunity of splitting (as it exists for grapheme cluster) is
|
|
||||||
not an option in that case, because splitting a code point would not create
|
|
||||||
any usable output.
|
|
||||||
Counting code points is widely supported in programming languages and can
|
|
||||||
easily be implemented for encoded strings when not.
|
|
||||||
The &w3xml; standard also defines a character as a unicode code point, thus
|
|
||||||
counting code points is equivalent to counting XML characters.
|
|
||||||
</p>
|
</p>
|
||||||
|
|
||||||
|
<section2 topic='Developer notes' anchor='subsequence-developer-notes'>
|
||||||
|
<p>
|
||||||
|
Subsequence indexing in various programming languages match the convention
|
||||||
|
described here. When using Python, the subsequence created by
|
||||||
|
<tt>body[begin:end]</tt> matches all requirements of this document.
|
||||||
|
</p>
|
||||||
|
<p>
|
||||||
|
Some programming languages define subsequences by offset and length. In
|
||||||
|
this case, begin matchs the offset while end-begin matches the length.
|
||||||
|
</p>
|
||||||
|
</section2>
|
||||||
|
<section2 topic='Rationale' anchor='subsequence-rationale'>
|
||||||
|
<p>
|
||||||
|
The convention for subsequences was choosen because it has three main
|
||||||
|
advantages: It matches subsequence indexing in various programming
|
||||||
|
languages, end minus begin of a subsequence equal the length of the
|
||||||
|
subsequence and the end of the first of two adjacent subsequence matches the
|
||||||
|
begin of the second one.
|
||||||
|
</p>
|
||||||
|
</section2>
|
||||||
</section1>
|
</section1>
|
||||||
|
|
||||||
<section1 topic='Glossary' anchor='glossary'>
|
<section1 topic='Glossary' anchor='glossary'>
|
||||||
|
Loading…
Reference in New Issue
Block a user