Roku Community

belltown · ‎07-19-2016

"renojim" wrote:
I tried changing the font, but it didn't make a difference. I only have Consolas, Lucida Console, and Raster Fonts to choose from. I think there's more going on here than just the font. I don't think the Windows 7 console understands UTF-8.

-JT

10 more days to take advantage of the free upgrade to Windows 10, which has a much improved console experience. In Windows 7, you might be able to get UTF-8 console support by typing chcp 65001 at the command prompt (and using the Consolas or Lucida Console font), although I no longer have a Win7 system to test that out on. Of course, if you're doing this in a debug session you can use PurpleBug, which should print all currency symbols correctly; if not, let me know.

renojim · ‎07-19-2016

"RokuKC" wrote:
"renojim" wrote:

...
...
asc() values:  48  44  55  57  32  8364  <- I thought asc() gave a value <= 255?
String characters represent Unicode code points.
8364 = U+20AC 'EURO SIGN'

I think I'm starting to get this. Internally the string is encoded as UTF-8. When I use asc() to get the ASCII code for the 6th character from the string, which is represented by 3 bytes in UTF-8, I get a Unicode value. Now how all that chooses a character from the font I'm using, I have no idea, but then that's exactly why I wanted to test this in the first place.

-JT

Roku Community Streaming Expert

Help others find this answer and click "Accept as Solution."
If you appreciate my answer, maybe give me a Kudo.

I am not a Roku employee.

renojim · ‎07-19-2016

EnTerr, belltown, keep in mind that you can't test this from the debugger. That's the whole problem. You can only test IAPs from a packaged channel and you can only test locales by creating a user account "in" another country. I'm using an updated and improved* version of my DbgPrint and netcat running in a Windows 7 console to capture the output.

I'll give the Windows 10 console a try, but for various reasons I won't be updating my laptop.

-JT

* - I found that using UDP (roDatagramSocket) works a lot better

Roku Community Streaming Expert

Help others find this answer and click "Accept as Solution."
If you appreciate my answer, maybe give me a Kudo.

I am not a Roku employee.

belltown · ‎07-19-2016

"renojim" wrote:
Now how all that chooses a character from the font I'm using, I have no idea, but then that's exactly why I wanted to test this in the first place.

Any Unicode "character" can be identified by a number, called a "code point".

The code point for the Euro sign has been assigned the decimal number of 8363 (hex 20AC).

The internal representation of a character is determined by its "encoding". One such encoding is UTF-8, which can represent any character as a sequence of one to four 8-bit bytes. The UTF-8 encoding for the Euro sign is the three-byte sequence: E2 82 AC (hex).

The external representation of a character is called a "glyph". The glyph used to represent a particular character depends on the "font" used to display the characters. A font maps code points to glyphs. The font must contain a glyph representing a character for that character to display correctly. What glyphs the font contains are up to the font designer; there's no uniform standard for that, and some fonts contain more glyphs than others.

To display a character correctly requires use of the correct encoding to read the character's internal byte representation, AND a font that contains a glyph that represents the character.

The reason your Windows 7 console displayed gibberish is because it was not aware that the character was encoded in UTF-8, i.e. it didn't know that the 3-byte sequence E282AC represented the Euro character. It displayed each byte as a separate character because that is what was called for by its default encoding.

BrightScript stores characters internally using the UTF-8 encoding. 'Asc' returns the code point number of the character's UTF-8 byte sequence (regardless of what the documentation says). 'Length' returns the number of characters, not bytes (which is correct in the documentation).

EnTerr · ‎07-19-2016

"renojim" wrote:
I think I'm starting to get this. Internally the string is encoded as UTF-8.

Not exactly. But close enough!

Internally (in memory) is likely represented as 16-bit wide-characters, `wchar_t` or `char16_t`. Long story but suffice to say in the past Unicode people thought 65536 possible characters "ought to be enough for anybody"... and were proven wrong. RAM representation doesn't matter, since it's "externalized" to UTF-8 when using WriteAsciiFile() or roURLTransfer.PostFromString() or roByteArray.FromAsciiString()

When I use asc() to get the ASCII code for the 6th character from the string, which is represented by 3 bytes in UTF-8, I get a Unicode value. Now how all that chooses a character from the font I'm using, I have no idea, but then that's exactly why I wanted to test this in the first place.

Yes.

EnTerr · ‎07-19-2016

"belltown" wrote:
BrightScript stores characters internally using the UTF-8 encoding.

You are right on the big picture. But this minor point will be very, very unusual if true. Strings in practice are not stored as UTF-8, because that way figuring out which the N-th characters is would be tedious (either count every time from the very beginning or create an index - or be grossly wrong about what mid(beg, len) returns).

Instead, i believe the practice is to store them in UCS-2, with surrogate pairs for characters are outside the BMP (>65536). Yeah, that gives wrong length and indexing for these but libraries re-define the meaning of "length" for that.

belltown · ‎07-19-2016

"EnTerr" wrote:
"belltown" wrote:
BrightScript stores characters internally using the UTF-8 encoding.

You are right on the big picture. But this minor point will be very, very unusual if true. Strings in practice are not stored as UTF-8, because that way figuring out which the N-th characters is would be tedious (either count every time from the very beginning or create an index - or be grossly wrong about what mid(beg, len) returns).

That's true. I was just explaining the concepts, not the implementation.

EnTerr · ‎07-19-2016

I may have to eat crow on this one, because this does not seem to fit my theory (it should have mis-behaved counting if broken into surrogate pairs):

BrightScript Debugger> s = "123" + chr(1e6) + "567": ? len(s)
 7

BrightScript Debugger> ? url.escape(s)
123%F3%B4%89%80567

BrightScript Debugger> ? mid(s,7,1)
7

RokuKC, will you tells us, pretty please? 🙂

(@renojim - sorry for the confusion i am causing with this discussion - internal repr. has no bearing on how a character gets displayed!)

EnTerr · ‎07-19-2016

Oh, marmalade!

BrightScript Debugger> tm = createObject("roTimeSpan")
BrightScript Debugger> s = string(2^12, "*"): s = s.replace("*", s)
BrightScript Debugger> tm.mark(): n = len(s): ? tm.TotalMilliseconds()
 147

BrightScript Debugger> s2 = s.replace("*", chr(1e6))
BrightScript Debugger> tm.mark(): n = len(s2): ? tm.TotalMilliseconds()
 591

😉 @belltown

belltown · ‎07-19-2016

"EnTerr" wrote:
Oh, marmalade!

BrightScript Debugger> tm = createObject("roTimeSpan")
BrightScript Debugger> s = string(2^12, "*"): s = s.replace("*", s)
BrightScript Debugger> tm.mark(): ln = len(s): ? tm.TotalMilliseconds()
 147

BrightScript Debugger> s = s.replace("*", chr(1e6))
BrightScript Debugger> tm.mark(): ln = len(s): ? tm.TotalMilliseconds()
 591

😉 @belltown

Yes, len() counts the chars, one by one. So it probably indexes that way too, and maybe does store UTF-8 chars as contiguous bytes internally.

Roku Community

Roku Developer Program

Re: How can developer test with different regions?

Re: How can developer test with different regions?

Re: How can developer test with different regions?

Re: How can developer test with different regions?

Re: How can developer test with different regions?

Re: How can developer test with different regions?

Re: How can developer test with different regions?

Re: How can developer test with different regions?

Re: How can developer test with different regions?

Re: How can developer test with different regions?