OK, since i have history being misunderstood - let me start with:
- Disclaimer#1: this may be of interest only to a few people - those who have to deal with big amounts of text data (say massive XML/HTML) in low-memory situations (pre-2013 Rokus)
- Disclaimer#2: the exact way this "device" functions may seem non-obvious, confusing or bizarre. But it works. Personally I enjoy understanding the mechanics, you don't have to.
Let's start with a motivating example: say you have a behemoth data structure with lots of strings, memory is running low. Say it's a 2.5MB XML file of about 100k elements, 100k attributes and 50k of leaf texts - which approximately is the maximum that a "pre-2013" Roku can parse without crashing. But then suddenly you realize there are many repeats of the same strings that eat memory for no reason. On the example of *ML there are many (most!) repeating tag names, attribute names/values and texts.
So... if it was possible to use the same string object for every occurrence of the same name, then hundreds of thousands of duplicated strings wont be taking up RAM. That's what i was thinking today and suddenly realized not only is it possible but can be automated. And the way it is done is wacky. Lo and behold, the "deduper" device:
function dedupe(s as String, deduper as Object):
_s = deduper[s]
if _s = invalid:
deduper[s] = s
_s = deduper[s]
end if
return _s
end function
'inlined form of the same:
' _s = deduper[s]: if _s = invalid then deduper[s] = s: _s = deduper[s]
And here is demonstration of use: first we create an array with a million "newstring"s - and we seen 1,000,000 BSC objects are kept:
BrightScript Debugger> a = [ ]
BrightScript Debugger> old = runGarbageCollector().count: for i=1 to 1000000: s = "new" + "string": a.push(s): end for: ? runGarbageCollector().count - old
1000000
BrightScript Debugger> ? a.count(), a[0], a[999999]
1000000 newstring newstring
Now the same thing but with deduping - note how only one object was used instead of a million:
BrightScript Debugger> deduper = { } : deduper.setModeCaseSensitive()
BrightScript Debugger> a = [ ]
BrightScript Debugger> old = runGarbageCollector().count: for i=1 to 1000000: s = "new" + "string": a.push(dedupe(s, deduper)): end for: ? runGarbageCollector().count - old
1
BrightScript Debugger> ? a.count(), a[0], a[999999]
1000000 newstring newstring
BrightScript Debugger> ? deduper
newstring: newstring
I added deduping code to my homebrew xml parser and because of the decreased memory use now it can parse ~2x bigger XML than roXmlElement can (before rebooting the player).
Questions/comments?
I wonder after writing all this, does anyone here deal with big amounts of text data?