Roku Developer Program

Join our online forum to talk to Roku developers and fellow channel creators. Ask questions, share tips with the community, and find helpful resources.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
EnTerr
Roku Guru

roRegex.split() is quadratically slow?!

I have been chugging along with my home-brew parser but when i started doing some performance testing on bigger files, i noticed it is way too slow. Not just 10-20x slower than roXmlElement, more of a Godzillian bazillion times slow (5mins for 200kb on a 2050x). And the bigger the file, the worst it gets - not linearly but quadratically. I searched and found the non-linearity culprit to be roRegex.split() - its performance seems to drop as a square of the number of matches in the string.

To demonstrate, a mock example here (on a middle-of-the-road #3100):
BrightScript Debugger> ti = createObject("roTimespan")
BrightScript Debugger> re = createObject("roRegex", ",", "")
BrightScript Debugger> s = ",a": for i = 1 to 20: ti.mark(): ? re.split(s).count(), ti.TotalMilliseconds(): s = s + s: end for
...
2049 165
4097 540 'prev x 3.3
8193 1995 'x3.7
16385 8259 'x4.1
32769 33991 'x4.1
65537 148553 'x4.4, =2.5 min
131073 630905 'x4.2, =10 min
^C

Notice how each time the number of matches increases 2x but time spend splitting goes up 4x. So it behaves like a quadratic algorithm O(n^2) and i don't understand why. The pattern used is trivial (no wildcards), no memory bounds have been hit - it should have stayed linear. Any ideas?
0 Kudos
9 REPLIES 9
EnTerr
Roku Guru

Re: roRegex.split() is quadratically slow?!

After some huffing and puffing, i settled on using roString.tokenize() in heavy cases. tokenize() behaves linearly and numbers look much different:

BrightScript Debugger> s = ",a": for i = 1 to 20: ti.mark(): ? s.tokenize(",").count(), ti.TotalMilliseconds(): s = s + s: end for
...
2048 29
4096 59 'prev x 2
8192 113 'x2
16384 231 'x2
32768 445 'x2
65536 932 'x2
131072 1808 'x2, =1.8 sec
...

I meditated over it and while I don't know for sure why roRegex.split() behaves badly, i have a theory. It might be that the implementation of roRegex.split() is using pcre_exec() with startoffset=0 to cleave the string into two pieces: head (first match) and tail (rest of the string), then push the head to result and repeat: do pcre_exec(code, extra, subject = tail, length, startoffset = 0, options, ovector, ovecsize) and split to new head & tail; rinse, repeat. This is not wrong per se but unfortunately for N occurences of regex, it creates N tails in the process; each tail being partial copy of the big string, with lengths decreasing from approx. len(str)*(N-1)/N down to len(str)/N - for an average len(tail) = len(str)/2. But N * len(str)/2 = N * N*const = O(N^2) memory operations and this quadratic behavior gets noticeable for big N. Tricky-tricky. Not to worry, others (Google Dart) got the same regex jab.

Can it be fixed easily - sure, by not making copies but calling pcre_exec with sliding startoffset instead. But i doubt that will be given any priority, under presumption that regex.split() would be used only on strings with small number of pattern repeats. It will be easy to check if it got fixed with the sample code above - but till that imaginary future, i propose this takeout:
Use regex.split(str) only with a small number (<1000) of repeats of regex inside str
0 Kudos
adrianc1982
Visitor

Re: roRegex.split() is quadratically slow?!

still the problem is split does a better job parsing any kind of text while tokenize has to use a delimiter placed where YOU need it and thats not usually the case when parsing documents or web pages.

I was using split before on my webpage and had to choose delimiters or jsons in the end i went with jsons as its the cleaner way to do it.

I hope .split gets fixed, as its a useful option for parsing..
0 Kudos
belltown
Roku Guru

Re: roRegex.split() is quadratically slow?!

"adrianc1982" wrote:
still the problem is split does a better job parsing any kind of text while tokenize has to use a delimiter placed where YOU need it and thats not usually the case when parsing documents or web pages.

I was using split before on my webpage and had to choose delimiters or jsons in the end i went with jsons as its the cleaner way to do it.

I hope .split gets fixed, as its a useful option for parsing..

For parsing web pages, doing the parsing directly from the Roku can be very cumbersome and slow. One of the approaches I use if I have to do that is to do all the parsing on a server. I'll define my own interface between the server and Roku, some easily parsible Xml. That makes the Roku code extremely trivial, and stable. Any changes can be made to the server code. You can do whatever you want in the server and make use of much more powerful parsing tools than you'd find on the Roku. [One of my favorites in PHP is to use pass the web page text into the DOMDocument @loadHTML method, from which you create a DOMXpath object and use XPath queries to parse the data -- extremely powerful and takes very little code, compared to doing the same thing on the Roku]. Although if it's your own web page, one that you have control over, then why are you even parsing that from the Roku in the first place rather than returning Xml or Json to the Roku?
0 Kudos
EnTerr
Roku Guru

Re: roRegex.split() is quadratically slow?!

"adrianc1982" wrote:
still the problem is split does a better job parsing any kind of text while tokenize has to use a delimiter placed where YOU need it and thats not usually the case when parsing documents or web pages.
...
I hope .split gets fixed, as its a useful option for parsing..

Right. Certainly will be better if it is fixed - i was just taking a guess they won't give it priority, when trying to see the big picture from their corporate point of view/goals.

roString.Tokenize() is has its own problems, even in the case of using simple unique delimiter. Say i know i am getting 4 strings separated by | - and i am guaranteed they cannot contain the delimiter. Say it contains "first_name|mid_name|last_name|phone_no". So i tokenize:
BrightScript Debugger> ? "John||Doe|".tokenize("|")
John
Doe

We get array of 2 strings in result - this is because tokenize() ignores any separator repeats (and even more - it drops any empty elements). So i can guess two of the fields were empty strings. But which one? Was it 1st and 2nd, or 1st and 3rd, or... there are 6 possibilities (4-choose-2? i think) and no way to tell.

This is because tokenize() takes after C's strtok(), explained RokuMarkn elsewhere. It is happening in my examples above too - roRegex gets 2049 elements, tokenize() gets 2048 on the same string, losing one. It is a contingency we have to be prepared when using tokenize().
0 Kudos
adrianc1982
Visitor

Re: roRegex.split() is quadratically slow?!

"adrianc1982" wrote:
still the problem is split does a better job parsing any kind of text while tokenize has to use a delimiter placed where YOU need it and thats not usually the case when parsing documents or web pages.

I was using split before on my webpage and had to choose delimiters or jsons in the end i went with jsons as its the cleaner way to do it.

I hope .split gets fixed, as its a useful option for parsing..


why not parse with roku, i was using split and it did a great job, with 200 - 400 elements at a time, with 1000 it could take up to a minute or more to process, but 200 - 400 would do it in a second or two, which is very okay in my book.

The webpage doesnt change and split does what i needed why do extra code on my server when i can parse without a problem. The problem is split gets slower with 500 or more elements, thats the problem, but it does the job fine.

Like you say in the end i had to identify the roku request and print out a string that looks like a json so roku can make its work around split. My post was just to make a little noise on the topic as this needs to be fixed sometime and if no one talks about it we wont get it fixed.
0 Kudos
adrianc1982
Visitor

Re: roRegex.split() is quadratically slow?!

"EnTerr" wrote:
"adrianc1982" wrote:
still the problem is split does a better job parsing any kind of text while tokenize has to use a delimiter placed where YOU need it and thats not usually the case when parsing documents or web pages.
...
I hope .split gets fixed, as its a useful option for parsing..

Right. Certainly will be better if it is fixed - i was just taking a guess they won't give it priority, when trying to see the big picture from their corporate point of view/goals.

roString.Tokenize() is has its own problems, even in the case of using simple unique delimiter. Say i know i am getting 4 strings separated by | - and i am guaranteed they cannot contain the delimiter. Say it contains "first_name|mid_name|last_name|phone_no". So i tokenize:
BrightScript Debugger> ? "John||Doe|".tokenize("|")
John
Doe

I get array of 2 strings in result - this is because tokenize() ignores any separator repeats (and even more - it drops any empty elements). So i can guess two of the fields were empty strings. But which one? Was it 1st and 2nd, or 1st and 3rd, or... there are 6 possibilities i think (4 choose 2?). There is no way to tell.

This is because tokenize() takes after C's strtok(), explained RokuMarkn elsewhere. And you can see above in my examples, it is happening too - roRegex gets 2049 elements, tokenize() gets 2048 on the same string, losing one. It's a contingency we have to be prepared when using tokenize().


i wouldnt use tokenize for anything serious really, like you say it has way to many holes and im pretty sure i can find my way around tokenization 🙂
0 Kudos
EnTerr
Roku Guru

Re: roRegex.split() is quadratically slow?!

"belltown" wrote:
For parsing web pages, doing the parsing directly from the Roku can be very cumbersome and slow. One of the approaches I use if I have to do that is to do all the parsing on a server. I'll define my own interface between the server and Roku, some easily parsible Xml. That makes the Roku code extremely trivial, and stable. Any changes can be made to the server code. You can do whatever you want in the server and make use of much more powerful parsing tools than you'd find on the Roku. [One of my favorites in PHP is to use pass the web page text into the DOMDocument @loadHTML method, from which you create a DOMXpath object and use XPath queries to parse the data -- extremely powerful and takes very little code, compared to doing the same thing on the Roku]. Although if it's your own web page, one that you have control over, then why are you even parsing that from the Roku in the first place rather than returning Xml or Json to the Roku?

Oh absolutely - if one has control over the server-side, it's easier to pick a more digestible form for data than HTML. For that matter, XML is not a data transfer format either. I advocate JSON use instead - it is less verbose (bandwidth) than xml and woking with the nested arrays and hashes is intuitive.

I would like to say a good word for roXML though: in certain cases it can parse HTML and its use may be very convenient. Like say this:
BSD> html = createObject("roXmlElement")
BSD> xfer = createObject("roURLTransfer"): xfer.setURL("http://forums.roku.com/viewforum.php?f=34")
BSD> html.parse(xfer.getToString())
BSD> tags = html.body.div.div.div.div.ul.li.dl.dt.a
BSD> for k = 0 to tags.count() - 1 step 2: ? tags[k].getText(): end for

Scheduled Forum Maintenance - 8/5/14
Beta Testers Wanted
New Eclipse Plugin Release 1/7/2014
SDK Docs: links to fix, missing/broken/etc
Where is screensaver documentation?
How Do I Start Developing a Channel?
EnTerr's unanswered questions list
...
Extract a list of this forum's topics in just 2 lines? Why, thank you very much, roXML! The clever definition of methods in roXmlList allows for such SELECT of node subsets, XPath-like. I included the example since i suspect many people may have overlooked that feature, as i did initially.

Now - roXml has Issues, too. It does not parse right on fw3. And it cannot select by attributes like ID or CLASS. So i am writing my own parser that is more robust (targeting to parse straight-up HTML) and have some wonderful idea about structured queries using dictionary literals as patterns.

I am trying to speed-up my homebrew and coincidently, @belltown - i noticed that couple of years ago you wrote parser, JSONDecoder - with focus on speed. Glancing over it i see you use roByteArray and numerals instead of roRegex and strings. Why is that and do you have "words of wisdom" to share about parser performance in BrSc?
0 Kudos
RokuMarkn
Visitor

Re: roRegex.split() is quadratically slow?!

"EnTerr" wrote:
For that matter, XML is not a data transfer format either. I advocate JSON use instead - it is less verbose (bandwidth) than xml and woking with the nested arrays and hashes is intuitive.


I really REALLY don't want this to develop into an XML vs. JSON debate, but I couldn't let these statements stand unchallenged to possibly pollute the minds of impressionable young programmers. 😉 Millions of bytes of XML are being successfully transferred across the web every day, so it's not correct to say "XML is not a data transfer format". A more accurate statement is "XML is more verbose than other data representation languages". And it's certainly NOT true, as your linked page states, that XML is a document markup language. That could fairly describe HTML, but XML was designed as a data representation language. It is more verbose but more powerful than JSON. Which is more intuitive is a subjective matter; I'm actually surprised to hear that someone thinks JSON is more intuitive, as I find XML quite readable. Anyway, a blanket statement that JSON is better than XML for data transfer is controversial at best, and if one were in a position to choose a format, one should evaluate all the variables involved and choose the format best suited to the problem.

--Mark
0 Kudos
belltown
Roku Guru

Re: roRegex.split() is quadratically slow?!

"EnTerr" wrote:

Oh absolutely - if one has control over the server-side, it's easier to pick a more digestible form for data than HTML

In one case I was thinking of, I have no control over the server from whence the source data is served (some of it HTML, and some of it Json), and certainly wouldn't consider parsing that from the Roku. However, what I was alluding to was using an intermediate server, one that you do have some degree of control over, which parses the source web pages then generates XML, which it in turn serves to the Roku. My WHATSONCHANNEL for example, when serving up listings from TV cable or satellite providers, can serve up 3 hours' worth of listings from a couple of hundred channels (something it really wasn't intended for), in less than 5 seconds -- and that uses a Php script running on a free web host 5,000 miles away in France (with my channel automatically falling back to another - free - server in Texas if the primary server goes down). If I wasn't so cheap I could probably pay for an account on a faster server, but I don't care because most of the requests from that channel (local OTA TV listings) are served in less than 3 seconds, and no-one's complained about that yet. If you're accessing other APIs that you have no control over, this method also gives you the advantage that you can strip out all the data that's not relevant to your Roku channel, and present it in any form you wish.

"EnTerr" wrote:

I would like to say a good word for roXML though: in certain cases it can parse HTML and its use may be very convenient.

That's something I wasn't aware of. That's good to know.

"EnTerr" wrote:

I am trying to speed-up my homebrew and coincidently, @belltown - i noticed that couple of years ago you wrote parser, JSONDecoder - with focus on speed. Glancing over it i see you use roByteArray and numerals instead of roRegex and strings. Why is that and do you have "words of wisdom" to share about parser performance in BrSc?


When I wrote my Json parser, back before Roku had native Json support, and when all I had was a basic Roku 1, I looked at a couple of other (Regex-based) Json parsers out there. They either had limitations in the Json that they would parse, or were just very slow, particularly when parsing larger amounts of data. I briefly looked at using roRegex and roString functions, but quickly came to the conclusion that string-handling was just very slow no matter what you used, even with "native" roRegex/roString handling. I'm not well versed in the inner workings of regex parsers, but I'd imagine that due to their complexity there's a lot of data access and CPU cycles being used as it scans the input data over and over again trying to match patterns. Also, with regex's, a lot depends on how you use them. Just about any given match can be performed using any number of different regex constructs. Even though I love regex's I didn't want to spend too much of my time trying to figure out which of the gazillion possible regex techniques was the fastest to use in each particular situation.

So I decided to keep things as simple as possible under the assumption that even interpreted code could be relatively fast if the underlying byte-code that was generated didn't have to do much. I generally tried to keep the number of data accesses to a minimum, even if that involved writing more lines of code. As you're probably aware, speed has nothing to do with how many lines of code you write; it's how many CPU instructions and memory accesses take place, etc. that counts.

My general design principles were to as far as possible:
- Convert the input string to a byte array, as BS seemed to process arrays of bytes fairly efficiently.
- Examine each character (byte) in turn, once and only once, with no lookahead, no copying, no stacks, etc., although I did make use of recursion in BS.
- When examining characters (bytes) to check whether they were number characters, or whitespace characters, or escape characters, etc., instead of having a comparison with multiple conditions (if char=space OR char=tab OR char=carriage-return OR char=newline), I just used a look-up table indexed by the character value (if whitespaceTable [byte])
- Stay away from BS roRegex and roString functions, which although they'd save lines of code, I just didn't think that all the stuff they had going on behind the scenes made up for that. [Subject to what I said above about how you use them being a significant factor].

I ran some tests on various-sized data sets as found that my parser was significantly faster on large data sets than the others, and generally comparable on smaller data sets. Then Roku made their own native Json parser, so I didn't need mine any more.

Paring Json, however, is way easier than parsing Xml. I can see why you'd want to save writing code by using regex's in your case. Just don't expect it to be very fast.
0 Kudos