Roku Developer Program

Developers and content creators—a complete solution for growing an audience directly.
cancel
Showing results for 
Search instead for 
Did you mean: 
EnTerr
Level 8

String de-duper device

OK, since i have history being misunderstood - let me start with:
- Disclaimer#1: this may be of interest only to a few people - those who have to deal with big amounts of text data (say massive XML/HTML) in low-memory situations (pre-2013 Rokus)
- Disclaimer#2: the exact way this "device" functions may seem non-obvious, confusing or bizarre. But it works. Personally I enjoy understanding the mechanics, you don't have to.

Let's start with a motivating example: say you have a behemoth data structure with lots of strings, memory is running low. Say it's a 2.5MB XML file of about 100k elements, 100k attributes and 50k of leaf texts - which approximately is the maximum that a "pre-2013" Roku can parse without crashing. But then suddenly you realize there are many repeats of the same strings that eat memory for no reason. On the example of *ML there are many (most!) repeating tag names, attribute names/values and texts.

So... if it was possible to use the same string object for every occurrence of the same name, then hundreds of thousands of duplicated strings wont be taking up RAM. That's what i was thinking today and suddenly realized not only is it possible but can be automated. And the way it is done is wacky. Lo and behold, the "deduper" device:

function dedupe(s as String, deduper as Object):
_s = deduper[s]
if _s = invalid:
deduper[s] = s
_s = deduper[s]
end if
return _s
end function

'inlined form of the same:
' _s = deduper[s]: if _s = invalid then deduper[s] = s: _s = deduper[s]


And here is demonstration of use: first we create an array with a million "newstring"s - and we seen 1,000,000 BSC objects are kept:
BrightScript Debugger> a = [ ]
BrightScript Debugger> old = runGarbageCollector().count: for i=1 to 1000000: s = "new" + "string": a.push(s): end for: ? runGarbageCollector().count - old
1000000
BrightScript Debugger> ? a.count(), a[0], a[999999]
1000000 newstring newstring

Now the same thing but with deduping - note how only one object was used instead of a million:
BrightScript Debugger> deduper = { } : deduper.setModeCaseSensitive() 
BrightScript Debugger> a = [ ]
BrightScript Debugger> old = runGarbageCollector().count: for i=1 to 1000000: s = "new" + "string": a.push(dedupe(s, deduper)): end for: ? runGarbageCollector().count - old
1
BrightScript Debugger> ? a.count(), a[0], a[999999]
1000000 newstring newstring
BrightScript Debugger> ? deduper
newstring: newstring


I added deduping code to my homebrew xml parser and because of the decreased memory use now it can parse ~2x bigger XML than roXmlElement can (before rebooting the player).

Questions/comments?
I wonder after writing all this, does anyone here deal with big amounts of text data?
0 Kudos
4 Replies
belltown
Level 7

Re: String de-duper device

"EnTerr" wrote:
the exact way this "device" functions may seem non-obvious, confusing or bizarre

It's not too mysterious. In the first case, when you use s = "new" + "string": a.push(s) in a loop, the use of a string in an expression (e.g. concatenation) other than a simple assignment results in the creation of a new BrightScript object, an roString. This object is pushed into your array. The object has a reference count of 1, since the array is referencing it. The next time through the loop, a new roString is created. The previous roString object cannot be deleted as it is still referenced from the array, so a new roString is created, and also added to the array with a reference count of one.

A similar example would be: x="A"+"B" : y=x : x="C"+"D" [in this case you end up with 2 new objects; the "A"+"B" object is not deleted as it is referenced from y]
But here: x="A"+"B" : x="C"+"D" [you only end up with one new object; the "A"+"B" object is deleted as it is no longer referenced from anywhere]
But in this case: x="A" : y=x : x="C" [no objects are created; a string literal used by itself is an intrinsic type and is copied into x and y, not reference-counted]

The above behavior is explained (although not very well) in the BrightScript Reference sections 3.2 and 3.7.10.

In your second case, you are creating a new roString object, s, each time through the loop. However, you are not referencing this object from your array. Instead, you're creating a new object and adding it to your deduper associative array if it does not already exist there. You're then pushing the value from the deduper associative array into your array, allowing the object assigned to s to be deleted the next time through the loop.

Basically, you're only storing one copy of each string (actually two copies, since the associative array stores the same string data as both the key and the value). I can see how this would save memory if you have many string values that each occur many times.

As far as using large amounts of text data is concerned, my BrightScript code doesn't generally need to handle more than around 250K of Xml, which it natively parses using roXmlElement.Parse () on my Roku 2 in about 350ms. However, I've run into problems on other platforms generating large Xml output where I've had to change the way I handle Xml. For example, with PHP I had to switch from using DOMDocument to XMLWriter when generating large Xml output because the size of the DOM tree was just too large for my server to handle.

I'd be curious to know how long it would take to parse your 2.5MB Xml file with or without roXmlElement.
https://github.com/belltown/
0 Kudos
EnTerr
Level 8

Re: String de-duper device

"belltown" wrote:
It's not too mysterious.
...
Basically, you're only storing one copy of each string (actually two copies, since the associative array stores the same string data as both the key and the value). I can see how this would save memory if you have many string values that each occur many times.

And right* you are, sir. My worry was i wont do a good job explaining it - and you did better job that i probably could, so thank you!
The important part is that a dictionary is used as a "filter" of sorts to ensure that the very same object will be returned for a set of strings that are all equal to each other textually (as determined by ifAssociativeArray.Lookup() behind the scenes).

I'd be curious to know how long it would take to parse your 2.5MB Xml file with or without roXmlElement.

Well, err, umm, uhh - my homebrew parser is not blazing fast, that's for sure. For pure BrS it is but pales to one in C. On Roku3, a roXmlElement.parse(string_2mb_long) takes ~1000ms - and my DIY takes 16x longer. So it is not practical for files of that size (unless i write cooperative multitasking framework for Roku, so it backgrounds...). I was enthusing that i could beat "the big guy" in a corner case. However I want to be able to parse things like "<x>this <b>is</b> xml</x>" and roXml version on fw3 cannot do it. My parser can - yet because of speed seems 10K will be the practical limit on a Roku model# < 2200.

(*) Actually, you are super right - since you point out that two copies of each string will be stored: one as a key and one as a value - vs simply one with refcount+=2. That's because hash keys are stored outside the scope of the BS GC/instance manager - something i discovered only this week. The extra memory will be released when "deduper" AA is disposed anyway.
0 Kudos
quartern
Level 7

Re: String de-duper device

Some script interpreters do this themselves. The one I have seen do it is Lua.
If memory serves they used the same string instance weather the string was used as a value or a key
(so you would cut in half your usage if above posts are accurate)

Obviously, by your results BritScript does not but it could easily be made to do so
Private apps: IsraTV (replaces IsraIBA, IsraNews2, IsraI24, Isra10, Isra20)
Users - to report issues with the app (not content of streams please) send me a tweet - @quartern_roku and follow (so we can DM)
0 Kudos
EnTerr
Level 8

Re: String de-duper device

"quartern" wrote:
Some script interpreters do this themselves. The one I have seen do it is Lua.
If memory serves they used the same string instance weather the string was used as a value or a key
(so you would cut in half your usage if above posts are accurate)

Obviously, by your results BritScript does not but it could easily be made to do so

Right - that's called "string interning" and some languages do that for all strings, cue Lua. Other languages (#include Lisp galore) have them as separate types "symbol" or "atom".

There are `+` and `-` to interning, for example equality check for interned strings is extremely fast (heh, it's just a pointer comparison) but concatenation not so much, since they are immutable and a new object has to be constructed.

B/S has both mutable (roString) and immutable strings (e.g. string literals in code). Currently roAssociativeArray keys seem to be interned (good!) but roXmlElement.parse() seems to populate un-interned strings (bad!).

Specifically because of the kind of nails that XML is (ab)used to hammer, there is a massive amounts of string duplication in it. I am talking monstrously, ginormous duplication in element names and attribute names - but also huge in attribute values and element's contained texts. Perhaps not a design flaw of SGML but a pragmatic issue with XML reality. So, i would like to call on RokuCo to:

  • Fix roXmlElement.parse(), so it uses interned strings and thus cuts significantly memory waste.

In result, for very little "money" Roku will be able to parse multi-megabyte XMLs without crashing. (Crashing = undesirable in general)
0 Kudos