"RokuMarkn" wrote:
it's possible for UTF-8 files to be larger than the equivalent UTF-16. Characters between U+0800 and U+FFFF are represented by 2 bytes in UTF-16 but 3 bytes in UTF-8. So if the file consists primarily of text using these characters with little markup, the UTF-8 file can be up to 50% larger than the UTF-16 version.
Yes. Possible albeit rather unlikely for a typical (HT|X)ML file.
Making a generally true statement on size is getting complicated by special cases, just like the leap-year rule^ beyond mod 4=0. I was just adding clarification above when you posted. That part of the BMP plane is mostly used for Asian scripts (Chinese/Japanese/Korean/Hindi) and the UTF-8 encoding will be longer than UTF-16 one "
if there are more of these characters than there are ASCII characters". See
https://en.wikipedia.org/wiki/UTF-8#Disadvantages_4 .
In practice, UTF-8 html will be smaller than UTF-16 one, as demonstrated by back-of-the-envelope test by an editor there. For Japanese and Hindi versions of wikipedia article: HTML = UTF-8 was ~40% smaller than UTF-16; pure text = UTF-8 was ~15% bigger.
To reach +50% overhead, said Asian text will have to lack any and all of: spaces " ", new lines, numbers, punctuation marks ' " ( ) : , ; - ?! ... - as well as (.+)ML's abundance of <markup lots_of="ASCII"></markup>.
(^) i did a reckless thing last year, commented on IQ tests on Quora. Ever since it's been emailing me weekly IQ test things. Hence i was shown recently
this tricky (if unrelated to IQ) question.