Roku Developer Program

Developers and content creators—a complete solution for growing an audience directly.
cancel
Showing results for 
Search instead for 
Did you mean: 
EnTerr
Level 8

UTF-8, XML, URL, WTF etc

I am having problem with XML downloaded from a URL, being parsed differently in fw3 and fw5. Namely, on a firmware 3 box it seems to disregard UTF-8 encoding. Here is what i am doing:

xfer = createObject("roURLTransfer")
xfer.setURL(url)
data = xfer.getToString()
xml = createObject("roXmlElement")
xml.parse(data)
And then from the element i need, getText() returns the right thing on fw 5 and multi-character mishmash on fw 3. Considering how straightforward the code is, i can't think of anything wrong i might be doing.

Some more details:

  • HTTP response has "Content-Type:application/rss+xml; charset=utf-8"

  • XML starts with "<?xml version="1.0" encoding="UTF-8"?>"

  • the element in question, in the raw, is "<description><![CDATA[(408) ••• •••• 7days [...]]]></description>",
    displays as "(408) ••• •••• 7days [...]" on fw5 and
    something like "(408) ••• •••• 7days [...]" on fw3.

  • representative character i checked is "", AKA bullet (&bull; &#8226; U+2022) - it comes as ASC(mid(s, .., 1))=8226 on fw5 and as 3 chars with ASC( )=-30, -128, -94 on fw3 (which seems to be the UTF8 encoding of U+2022: E2 80 A2)

So, how can i get the right string on fw 3?
Can i read the URL any different, any setting i can trigger? (I have zero control over the server - but it is sending the right format anyway).
0 Kudos
19 Replies
belltown
Level 7

Re: UTF-8, XML, URL, WTF etc

What component are you displaying the data with? I just tried putting a "•" in an XML file and it parsed and displayed correctly in an roSpringboard (ArtistName field) and on an roImageCanvas (Text) using both 3.1/1027 and 5.5/320.
https://github.com/belltown/
0 Kudos
EnTerr
Level 8

Re: UTF-8, XML, URL, WTF etc

"belltown" wrote:
What component are you displaying the data with? I just tried putting a "•" in an XML file and it parsed and displayed correctly in an roSpringboard (ArtistName field) and on an roImageCanvas (Text) using both 3.1/1027 and 5.5/320.

roListScreen (ShortDescriptionLine1, i believe).
If you loop over the string [for i=0 to len(s): c=mid(s,i,1): ? c, asc(c): end for] on fw3, what's there?
0 Kudos
belltown
Level 7

Re: UTF-8, XML, URL, WTF etc

"EnTerr" wrote:
"belltown" wrote:
What component are you displaying the data with? I just tried putting a "•" in an XML file and it parsed and displayed correctly in an roSpringboard (ArtistName field) and on an roImageCanvas (Text) using both 3.1/1027 and 5.5/320.

roListScreen (ShortDescriptionLine1, i believe).
If you loop over the string [for i=0 to len(s): c=mid(s,i,1): ? c, asc(c): end for] on fw3, what's there?

Using 3.1:


-30
-30
€ -128
¢ -94


Using 5.5:


€¢ 8226
€¢ 8226



^^^ That was using Firefox. IE shows:
Using 3.1:

-30 -30 € -128 ¢ -94


Using 5.5:

€¢ 8226


But they both are displayed as "•" on the screen.
https://github.com/belltown/
0 Kudos
EnTerr
Level 8

Re: UTF-8, XML, URL, WTF etc

Ok, so you get bullet (U+2022) read from xml to fw 5 and 3 with the same results like me. (The exact way IE or FF will show the component octets does not matter.)

Why roSpringboard displays 3 different characters like bullet is a mis(t)ery for me - but roListScreen does not. Would roSpringboard not display chr(8226) as a bullet too (am not near Roku now)?
0 Kudos
belltown
Level 7

Re: UTF-8, XML, URL, WTF etc

I just did another test with an roListScreen this time. I get the same results as you. The bullet (and other WTF characters) do NOT display correctly using 3.1 on an roListScreen. They display correctly on 5.5. They also display correctly using roSpringboard and roImageCanvas under both 3.1 and 5.5

It looks like the implementation of roListScreen UTF-8 encoding is flawed under 3.1
https://github.com/belltown/
0 Kudos
EnTerr
Level 8

Re: UTF-8, XML, URL, WTF etc

"belltown" wrote:
It looks like the implementation of roListScreen UTF-8 encoding is flawed under 3.1

It's not as simple. If properly done there shouldn't be UTF-8 remaining after a file has been read. The concept of encoding is "external" to the computer, how files are stored - while strings should consist of Unicode characters. The exemplary bullet should be a single character in the string, chr(8226) - just like it is in fw 5. That's why i am confused by roSpringboard, seems like somebody stuck a band-aid hack there to cover for URLs/XML being read as strictly single-byte encoding on fw3 (while read correctly on fw5). I need to try what will [s = chr(8226): ? asc(s), len(s)] do on fw3.
0 Kudos
TheEndless
Level 7

Re: UTF-8, XML, URL, WTF etc

"EnTerr" wrote:
"belltown" wrote:
It looks like the implementation of roListScreen UTF-8 encoding is flawed under 3.1

It's not as simple. If properly done there shouldn't be UTF-8 remaining after a file has been read. The concept of encoding is "external" to the computer, how files are stored - while strings should consist of Unicode characters. The exemplary bullet should be a single character in the string, chr(8226) - just like it is in fw 5. That's why i am confused by roSpringboard, seems like somebody stuck a band-aid hack there to cover for URLs/XML being read as strictly single-byte encoding on fw3 (while read correctly on fw5). I need to try what will [s = chr(8226): ? asc(s), len(s)] do on fw3.

I think you may have it backwards. I suspect that the issue is not with the XML parsing, but rather with the roListScreen's ability (or lack of) to display unicode characters. If the XML was parsing incorrectly, then it wouldn't display correctly on any screen.
My Channels: http://roku.permanence.com - Twitter: @TheEndlessDev
Instant Watch Browser (NetflixIWB), Aquarium Screensaver (AQUARIUM), Clever Clocks Screensaver (CLEVERCLOCKS), iTunes Podcasts (ITPC), My Channels (MYCHANNELS)
0 Kudos
belltown
Level 7

Re: UTF-8, XML, URL, WTF etc

"TheEndless" wrote:

I think you may have it backwards. I suspect that the issue is not with the XML parsing, but rather with the roListScreen's ability (or lack of) to display unicode characters. If the XML was parsing incorrectly, then it wouldn't display correctly on any screen.

That's what I was trying to say. The roListScreen does not seem to display non-ASCII UTF-8 characters correctly under the 3.1 firmware, but does display them under 5.5. Other components (roSpringboardScreen and roImageCanvas) seem to display them correctly under both 3.1 and 5.5.

I don't think the XML parsing is an issue. The bullet character (U+2022) is a single "character" that occupies 2 "bytes" whether stored in a file, or a Brightscript string or byte array, or used to determine which glyph from the appropriate font to display on the screen. So the statement "there shouldn't be UTF-8 remaining after a file has been read" doesn't make any sense. How else would you represent a 2-byte character?

Also, using the Brightscript Asc() and Chr() functions to debug this problem may not be entirely appropriate as the BrightScript Language Reference does state that these functions operate on ASCII characters. A Byte Array might be more appropriate if you want to see the actual bytes resulting from the XML parsing.
https://github.com/belltown/
0 Kudos
EnTerr
Level 8

Re: UTF-8, XML, URL, WTF etc

"TheEndless" wrote:
I think you may have it backwards. I suspect that the issue is not with the XML parsing, but rather with the roListScreen's ability (or lack of) to display unicode characters. If the XML was parsing incorrectly, then it wouldn't display correctly on any screen.

Nope, TheEndless -
I don't have it backwards: see on top where i said "it comes as ASC(mid(s, .., 1))=8226 on fw5 and as 3 chars with ASC( )=-30, -128, -94 on fw3 (which seems to be the UTF8 encoding of U+2022: E2 80 A2)" - meaning that if you look at the string in BS console, character-by-character - on fw3 you get THREE characters: chr(&hE2), chr(&h80) and chr(&hA2) instead of one, the bullet. You can check that E2 is â and A3 is ¢. The result of reading and parsing the URL is that the strings differ. No, really.
0 Kudos