Roku Developer Program

Developers and content creators—a complete solution for growing an audience directly.
cancel
Showing results for 
Search instead for 
Did you mean: 
tim_beynart
Level 7

UTF-16 not supported for XML parsing?

Part of my application has to retrieve a remote XML file and parse it. I ran into an issue with the file not being read correctly.
The file's encoding is "UTF-16 (little-endian)". The correct BOM is in the file (FF FE). But instead of an XML string, the contents of the file are "??<" and nothing else.
If I resave the document as "UTF-8", it parses correctly.
Is this expected? I thought XML parsers were required to handle UTF-16?

Thanks
0 Kudos
6 Replies
EnTerr
Level 8

Re: UTF-16 not supported for XML parsing?

Don't use UTF-16 (UCS-2) for file encoding, that's outdated technology (1980-1990s). And so is UTF-32/UCS-4.

With UTF-8
  • You don't have to care about endianness, like with UTF-16BE vs UTF-16LE.

  • There is no need of BOM either.

  • And there is no need to <?xml encoding=...> declare it either - UTF-8 is the presumptive one.

  • Your files will be smaller (~60% of UTF-16 for worst case HTML, e.g. Hindi/Japanese)

  • It's fully backwards compatible with ASCII

Use UTF-8 and you are golden.

PS. apparently >86% of the web uses UTF-8 and <0.1% does UTF-16. HTML5 seems to have strong opinion on encodings, saying "Authors should use UTF-8. Conformance checkers may advise authors against using legacy encodings. Authoring tools should default to using UTF-8 for newly-created documents." and in particular mandates if <meta charset=...> or <meta http-equiv="text/html;charset=..." are to be used, the value there MUST be UTF-8.
0 Kudos
RokuMarkn
Level 7

Re: UTF-16 not supported for XML parsing?

I agree with everything in EnTerr's post except for one point -- it's possible for UTF-8 files to be larger than the equivalent UTF-16. Characters between U+0800 and U+FFFF are represented by 2 bytes in UTF-16 but 3 bytes in UTF-8. So if the file consists primarily of text using these characters with little markup, the UTF-8 file can be up to 50% larger than the UTF-16 version.

--Mark
0 Kudos
EnTerr
Level 8

Re: UTF-16 not supported for XML parsing?

"RokuMarkn" wrote:
it's possible for UTF-8 files to be larger than the equivalent UTF-16. Characters between U+0800 and U+FFFF are represented by 2 bytes in UTF-16 but 3 bytes in UTF-8. So if the file consists primarily of text using these characters with little markup, the UTF-8 file can be up to 50% larger than the UTF-16 version.

Yes. Possible albeit rather unlikely for a typical (HT|X)ML file.

Making a generally true statement on size is getting complicated by special cases, just like the leap-year rule^ beyond mod 4=0. I was just adding clarification above when you posted. That part of the BMP plane is mostly used for Asian scripts (Chinese/Japanese/Korean/Hindi) and the UTF-8 encoding will be longer than UTF-16 one "if there are more of these characters than there are ASCII characters". See https://en.wikipedia.org/wiki/UTF-8#Disadvantages_4 .

In practice, UTF-8 html will be smaller than UTF-16 one, as demonstrated by back-of-the-envelope test by an editor there. For Japanese and Hindi versions of wikipedia article: HTML = UTF-8 was ~40% smaller than UTF-16; pure text = UTF-8 was ~15% bigger.

To reach +50% overhead, said Asian text will have to lack any and all of: spaces " ", new lines, numbers, punctuation marks ' " ( ) : , ; - ?! ... - as well as (.+)ML's abundance of <markup lots_of="ASCII"></markup>.

(^) i did a reckless thing last year, commented on IQ tests on Quora. Ever since it's been emailing me weekly IQ test things. Hence i was shown recently this tricky (if unrelated to IQ) question.
0 Kudos
Roku Employee
Roku Employee

Re: UTF-16 not supported for XML parsing?

"tim_beynart" wrote:
Part of my application has to retrieve a remote XML file and parse it. I ran into an issue with the file not being read correctly.
The file's encoding is "UTF-16 (little-endian)". The correct BOM is in the file (FF FE). But instead of an XML string, the contents of the file are "??<" and nothing else.
If I resave the document as "UTF-8", it parses correctly.
Is this expected? I thought XML parsers were required to handle UTF-16?

Thanks


How are you getting the string data to pass the XML parser?
I.e. are you using roUrlTransfer GetToString, or roUrlTransfer GetToFile + ReadAsciiFile, or ?

Adding UTF-16 support for selected purposes is on the request backlog, but is more likely to get approved if there are specific use cases that need it.
0 Kudos
belltown
Level 7

Re: UTF-16 not supported for XML parsing?

"tim_beynart" wrote:
I thought XML parsers were required to handle UTF-16?

Correct, according to the W3C Recommendation: Extensible Markup Language (XML) 1.0 (Fifth Edition):


4.3.3 Character Encoding in Entities
...
All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings.

Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark

But just because something is designated as MANDATORY in the standards, doesn't mean anything as far as Roku is concerned.

If the Roku is not reading the UTF-16 data, then your options are to convert the data to UTF-8, either on your server, or on your Roku, depending on how "efficient" you care about making your implementation.
https://github.com/belltown/
0 Kudos
EnTerr
Level 8

Re: UTF-16 not supported for XML parsing?

Looking for something unrelated, i ran into this: http://utf8everywhere.org/

Executive summary:
  • "UTF-16 is the worst of both worlds, being both variable length^ and too wide. It exists only for historical reasons and creates a lot of confusion."

  • Always use UTF8 for external representation (wire protocol, files) of text.

  • For historical reasons^^, the internal representation is more complicated. Use UTF8 if you can but beware the API might be dictated by the host platform (e.g. Java, .Net, Qt).


(^) e.g. Koala emoji U+1F428 will be represented as 2 UTF-16 "characters" (0xD83D 0xDC28 or 0x3DD8 0x28DC depending on byte-sexuality, yet another PITA) and string length() will return 2. Asking for the 1st or 2nd character will return a high or low surrogate half, .reverse() creates "invalid" string, sorting is not lexicographic... things are not rosy

(^^) Under the (wrong) early belief that all Unicode characters would fit in 16 bits, the early adopters of Unicode - Qt framework (1992), Windows NT 3.1 (1993) and Java (1995) started using a 2-byte encoding, UCS-2. Couple of years later, the dream of fixed-width encoding were shattered
0 Kudos