it's possible for UTF-8 files to be larger than the equivalent UTF-16. Characters between U+0800 and U+FFFF are represented by 2 bytes in UTF-16 but 3 bytes in UTF-8. So if the file consists primarily of text using these characters with little markup, the UTF-8 file can be up to 50% larger than the UTF-16 version.
Part of my application has to retrieve a remote XML file and parse it. I ran into an issue with the file not being read correctly.
The file's encoding is "UTF-16 (little-endian)". The correct BOM is in the file (FF FE). But instead of an XML string, the contents of the file are "??<" and nothing else.
If I resave the document as "UTF-8", it parses correctly.
Is this expected? I thought XML parsers were required to handle UTF-16?
I thought XML parsers were required to handle UTF-16?
4.3.3 Character Encoding in Entities
All XML processors MUST be able to read entities in both the UTF-8 and UTF-16 encodings.
Entities encoded in UTF-16 MUST and entities encoded in UTF-8 MAY begin with the Byte Order Mark