Roku Community

TheEndless · ‎07-29-2014

"EnTerr" wrote:
"TheEndless" wrote:
I think you may have it backwards. I suspect that the issue is not with the XML parsing, but rather with the roListScreen's ability (or lack of) to display unicode characters. If the XML was parsing incorrectly, then it wouldn't display correctly on any screen.

Nope, TheEndless -
I don't have it backwards: see on top where i said "it comes as ASC(mid(s, .., 1))=8226 on fw5 and as 3 chars with ASC( )=-30, -128, -94 on fw3 (which seems to be the UTF8 encoding of U+2022: E2 80 A2)" - meaning that if you look at the string in BS console, character-by-character - on fw3 you get THREE characters: chr(&hE2), chr(&h80) and chr(&hA2) instead of one, the bullet. You can check that E2 is â and A3 is ¢. The result of reading and parsing the URL is that the strings differ. No, really.

I still don't think it's the XML parser, but rather a general 3.1 issue, possibly with the roString component. See greubel's post here (which you, incidentally, participated in): viewtopic.php?f=34&t=66183&p=423559#p423559

Also, FWIW, printing to the console is not a valid way to test the result. Obviously, counting the chars is valid, but the output in the console isn't, as I regularly see unicode characters "print" completely differently in the console, while working fine everywhere in app.

My Channels: http://roku.permanence.com - Twitter: @TheEndlessDev
Instant Watch Browser (NetflixIWB), Aquarium Screensaver (AQUARIUM), Clever Clocks Screensaver (CLEVERCLOCKS), iTunes Podcasts (ITPC), My Channels (MYCHANNELS)

belltown · ‎07-29-2014

"EnTerr" wrote:

Nope, TheEndless -
I don't have it backwards: see on top where i said "it comes as ASC(mid(s, .., 1))=8226 on fw5 and as 3 chars with ASC( )=-30, -128, -94 on fw3 (which seems to be the UTF8 encoding of U+2022: E2 80 A2)" - meaning that if you look at the string in BS console, character-by-character - on fw3 you get THREE characters: chr(&hE2), chr(&h80) and chr(&hA2) instead of one, the bullet. You can check that E2 is â and A3 is ¢. The result of reading and parsing the URL is that the strings differ. No, really.

If you'd done what I'd suggested above (use an roByteArray instead of Asc, Chr, etc.), you'd see that you do indeed have it backwards.

Using both 3.1 and 5.5 firmware you get the exact same results after parsing the XML - 3 bytes: 226, 128, 162 [or in hex: E280A2].

Your assertion that ASC() gives a different result on 3.1 vs 5.5 does not prove your point. More likely, ASC handles non-ASCII UTF-8 charcters differently on the different firmware versions. (And, according to the documentation, isn't even designed to deal with non-ASCII characters anyway).

Try this:


xml = CreateObject ("roXMLElement")
xml.Parse (xmlWTFData)
ba = CreateObject ("roByteArray")
ba.FromAsciiString (xml.whatever.GetText ())
?ba
?ba.ToHexString ()

EnTerr · ‎07-29-2014

"TheEndless" wrote:
I still don't think it's the XML parser, but rather a general 3.1 issue, possibly with the roString component. See greubel's post here (which you, incidentally, participated in): viewtopic.php?f=34&t=66183&p=423559#p423559

Also, FWIW, printing to the console is not a valid way to test the result. Obviously, counting the chars is valid, but the output in the console isn't, as I regularly see unicode characters "print" completely differently in the console, while working fine everywhere in app.

I have no argument with you on either of these:
1) I too, suspect the problem is not in the XML parser. Incidentally, in greubel's thread you can see i kinda know WTF-8 i am talking about 8-)
2) Never in this thread did i rely on console output of unicode characters. i used ASC(), which is BRS function.

EnTerr · ‎07-29-2014

"belltown" wrote:
If you'd done what I'd suggested above (use an roByteArray instead of Asc, Chr, etc.), you'd see that you do indeed have it backwards.

The reason i did not respond to you is because i am waiting to get my hands back on a roku box and test what you proposed. I did not miss nor ignored what you said earlier.

destruk · ‎07-29-2014

Does Roku support WTF-8 encoding? 😉

EnTerr · ‎07-29-2014

"destruk" wrote:
Does Roku support WTF-8 encoding? 😉

That's apparently the one that firmware 3 uses, yes. Pun intended. 😛

belltown · ‎07-29-2014

Taking a wild guess here, I'd say that internally at some level, Roku/Brightscript stores strings in a UTF-16 encoding. That's probably the easiest way of handling multibyte encodings such as UTF-8 if you're providing functions such as Len, Mid, etc. that need to know about character indices vs byte indices -- provided, of course, that you don't support character sets outside of the Unicode basic multilingual plane, which as far as I can tell the Roku does not (it just seems to ignore them). It might explain why in later firmware versions, Asc returns 8226; it's the 16-bit UTF-16/UCS-2 encoded value for the BULLET (U+2022) code-point. Earlier firmware probably just iterated through each byte when using Mid/Asc, which is why you'd see 3 separate bytes when looking at a multi-byte UTF-8 string. When those strings get passed to Roku components that display them on the screen, it shouldn't matter how Roku stores them internally; it should convert as necessary -- unless there is a WTF bug as there appears to be in the 3.1 roListScreen component specifically.

Just another explanation of why you may think you're looking at different XML parser output, but actually it's really the same exact bytes in each case, just presented differently though different interfaces (ASC being one of them). Remember that a few years ago the Roku had no support for anything other than ASCII except for maybe the manifest files. Adding full Unicode support to an existing platform that only supports ASCII is a major undertaking, and I wouldn't be surprised if some parts slipped through the cracks along the way.

EnTerr · ‎07-30-2014

"belltown" wrote:
The bullet character (U+2022) is a single "character" that occupies 2 "bytes" whether stored in a file, or a Brightscript string or byte array, or used to determine which glyph from the appropriate font to display on the screen. So the statement "there shouldn't be UTF-8 remaining after a file has been read" doesn't make any sense. How else would you represent a 2-byte character?

Let's clarify this a little:

the bullet character has been assigned a Unicode code point (numeric value) of 8226 (0x2022).

Code points in Unicode are in the range [0, 0x10FFFF], which means at least 21 bits of storage are needed to identify a character

Working in units of 21 bits would be awfully awkward, since computers like to work in units of 8 (bytes). In a power-of-2-number (1, 2, 4) of octets, even.

Thus, in practice the following are used:
1. UCS-4/UTF-32 - where a single 32-bit word can represent ANY codepoint. It is simple to work with but grossly memory inefficient; rarely if ever used
2. UTF-16, which is a variable-length encoding, where a character can be encoded by one or two 16-bit words. In particular all assigned code-points from "basic multilingual plane" (0 - 0xFFFF) are encoded in 1 word. The original intent of Unicode was to use 16 bits for everything and maybe would have stayed that way, were it not for PRC demanding standard GB 18030 in 2000. UTF-16 strikes good balance between speed and compactness, mostly used for strings in RAM. Plagued by little- vs big-endian issues.
3. UTF-8, a variable-length encoding where each character is encoded by a sequence of 1, 2, 3 or 4 bytes. It is very cleverly designed and very compatible with older systems. All ASCII (0-127) characters are encoded as 1 bytes - the very same ASCII code, unchanged. And all other characters are encoded by 2 or more bytes in the 128-255 range. No endianness issues. Compact storage but trade-off for lower speed of string manipulation. Mostly used for external data (files/streams).

Back on topic: "•" is not a "2-byte character", one that "occupies 2 bytes whether stored in a file, or a Brightscript string or byte array". To wit, bullet takes 3 bytes (\xE2 \x80 \xA2) when UTF-8 encoded in file (http stream here). Also 3 bytes in roByteArray (i tried what you suggested and indeed, .fromAsciiString() is consistent on both firmwares). What it occupies internally as B/S string is open to speculations but experiments with MID() and LEN() seem to indicate underlying implementations differ in fw3 vs fw5. More on that when i reply to your last msg, i'll take breather from ranting now 🙂

belltown · ‎07-30-2014

Yes, I should have said 3 bytes. As for your "ranting" - WTF?

EnTerr · ‎07-31-2014

"belltown" wrote:
Taking a wild guess here, I'd say that internally at some level, Roku/Brightscript stores strings in a UTF-16 encoding. That's probably the easiest way of handling multibyte encodings such as UTF-8 if you're providing functions such as Len, Mid, etc. that need to know about character indices vs byte indices -- provided, of course, that you don't support character sets outside of the Unicode basic multilingual plane, which as far as I can tell the Roku does not (it just seems to ignore them). It might explain why in later firmware versions, Asc returns 8226; it's the 16-bit UTF-16/UCS-2 encoded value for the BULLET (U+2022) code-point. Earlier firmware probably just iterated through each byte when using Mid/Asc, which is why you'd see 3 separate bytes when looking at a multi-byte UTF-8 string.

I think our views started converging here. From what i have seen, seems that

in fw3, BRS strings internally are UTF-8 (probably plain-old C char* ASCIIZ)

in fw5, BRS strings internally are UCS-2, based on a "wide character" type (wchar_t/char16_t)

When those strings get passed to Roku components that display them on the screen, it shouldn't matter how Roku stores them internally; it should convert as necessary -- unless there is a WTF bug as there appears to be in the 3.1 roListScreen component specifically.

Agreed. In light of your disclosure that UTF-8 chars work fine in roSpringboard and roImageCanvas - seems clear the fix needed is in the way roListScreen displays text, i.e. "upgrade" it.

Roku Community

Roku Developer Program

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc

Re: UTF-8, XML, URL, WTF etc