"EnTerr" wrote:"TheEndless" wrote:
I think you may have it backwards. I suspect that the issue is not with the XML parsing, but rather with the roListScreen's ability (or lack of) to display unicode characters. If the XML was parsing incorrectly, then it wouldn't display correctly on any screen.
Nope, TheEndless -
I don't have it backwards: see on top where i said "it comes as ASC(mid(s, .., 1))=8226 on fw5 and as 3 chars with ASC( )=-30, -128, -94 on fw3 (which seems to be the UTF8 encoding of U+2022: E2 80 A2)" - meaning that if you look at the string in BS console, character-by-character - on fw3 you get THREE characters: chr(&hE2), chr(&h80) and chr(&hA2) instead of one, the bullet. You can check that E2 is â and A3 is ¢. The result of reading and parsing the URL is that the strings differ. No, really.
"EnTerr" wrote:
Nope, TheEndless -
I don't have it backwards: see on top where i said "it comes as ASC(mid(s, .., 1))=8226 on fw5 and as 3 chars with ASC( )=-30, -128, -94 on fw3 (which seems to be the UTF8 encoding of U+2022: E2 80 A2)" - meaning that if you look at the string in BS console, character-by-character - on fw3 you get THREE characters: chr(&hE2), chr(&h80) and chr(&hA2) instead of one, the bullet. You can check that E2 is â and A3 is ¢. The result of reading and parsing the URL is that the strings differ. No, really.
xml = CreateObject ("roXMLElement")
xml.Parse (xmlWTFData)
ba = CreateObject ("roByteArray")
ba.FromAsciiString (xml.whatever.GetText ())
?ba
?ba.ToHexString ()
"TheEndless" wrote:
I still don't think it's the XML parser, but rather a general 3.1 issue, possibly with the roString component. See greubel's post here (which you, incidentally, participated in): viewtopic.php?f=34&t=66183&p=423559#p423559
Also, FWIW, printing to the console is not a valid way to test the result. Obviously, counting the chars is valid, but the output in the console isn't, as I regularly see unicode characters "print" completely differently in the console, while working fine everywhere in app.
"belltown" wrote:
If you'd done what I'd suggested above (use an roByteArray instead of Asc, Chr, etc.), you'd see that you do indeed have it backwards.
"destruk" wrote:
Does Roku support WTF-8 encoding? 😉
"belltown" wrote:
The bullet character (U+2022) is a single "character" that occupies 2 "bytes" whether stored in a file, or a Brightscript string or byte array, or used to determine which glyph from the appropriate font to display on the screen. So the statement "there shouldn't be UTF-8 remaining after a file has been read" doesn't make any sense. How else would you represent a 2-byte character?
"belltown" wrote:
Taking a wild guess here, I'd say that internally at some level, Roku/Brightscript stores strings in a UTF-16 encoding. That's probably the easiest way of handling multibyte encodings such as UTF-8 if you're providing functions such as Len, Mid, etc. that need to know about character indices vs byte indices -- provided, of course, that you don't support character sets outside of the Unicode basic multilingual plane, which as far as I can tell the Roku does not (it just seems to ignore them). It might explain why in later firmware versions, Asc returns 8226; it's the 16-bit UTF-16/UCS-2 encoded value for the BULLET (U+2022) code-point. Earlier firmware probably just iterated through each byte when using Mid/Asc, which is why you'd see 3 separate bytes when looking at a multi-byte UTF-8 string.
When those strings get passed to Roku components that display them on the screen, it shouldn't matter how Roku stores them internally; it should convert as necessary -- unless there is a WTF bug as there appears to be in the 3.1 roListScreen component specifically.