Roku Community

btpoole · ‎01-05-2016

Have a situation where I am forced to use an html source which contains no xml structure. I am pulling out bits of information that falls between the <h1></h1> tags. I am using Instr and mid to get the information. My problem is that I can't figure out how to run thru the complete file. The script runs, hits the first set of tags, pulls the info then exits. Just not sure how to get script to get each <h1></h1>. In xml I know I can use For each to cycle thru. Any suggestions?

NewManLiving · ‎01-05-2016

Off the top of my head I would Consider walking through the string with instr and the mid functions checking the instr return index, or you could possibly split it into an array of lines using tokenize or a combo of replace and tokenize

My Channels: 2D API Framework Presentation: https://owner.roku.com/add/2M9LCVC
Updated: 11-11-2015 - Completed Keyboard interface
The Joel Channel ( Final Beta )

belltown · ‎01-05-2016

Sub Main ()
    html = "<h1>First Header</h1><div><h11>Blah<h1>Second Header</h1></h11><h1>Third Header</h1><h2>H2 Header<h2>"
    list = []
    re = CreateObject ("roRegex", ".*?<h1>(.*?)</h1>(.*)", "is")
    ma = re.Match (html)
    While ma.Count () > 2
        list.Push (ma [1])
        ma = re.Match (ma [2])
    End While
    Print list
End Sub

NewManLiving · ‎01-05-2016

Nice example - one for the toolbox

My Channels: 2D API Framework Presentation: https://owner.roku.com/add/2M9LCVC
Updated: 11-11-2015 - Completed Keyboard interface
The Joel Channel ( Final Beta )

btpoole · ‎01-06-2016

Very nice and helpful. Thank you greatly. Have not used regular expression much in other scripts, thought it was little confusing. I will have to look more into it.
Thanks again

EnTerr · ‎01-06-2016

Why don't you just give it the XML treatment with roXmlElemnet.parse()? And then walk the tree as if it were XML.

That will work with many HTMLs and even the remaining ones, most can be pre-patched for issues before a xml-parse.

quartern · ‎06-18-2016

Sorry for commenting on such an old post

"EnTerr" wrote:
Why don't you just give it the XML treatment with roXmlElemnet.parse()? And then walk the tree as if it were XML.

That will work with many HTMLs and even the remaining ones, most can be pre-patched for issues before a xml-parse.

Without knowing what the parse is failing on how would I know what to pre-patch?

Private apps: IsraTV (replaces IsraIBA, IsraNews2, IsraI24, Isra10, Isra20)
Users - to report issues with the app (not content of streams please) send me a tweet - @quartern_roku and follow (so we can DM)

EnTerr · ‎06-18-2016

"quartern" wrote:
Sorry for commenting on such an old post

No sweat - the dev.forum moves at near light-speed :?: -
i mean there is a time dilation and to an external stationary observer is seems couple of years pass between e.g. a bug being reported and fixed

Without knowing what the parse is failing on how would I know what to pre-patch?

Well, the "pre-patching" idea implies that you actually know in advance what's wrong with the HTML that makes it un-parsable. In other words can't feed it any random document from the World Wild Web. As of how to figure out where it fails, you can either

use validator to sniff what's fishy, e.g. https://validator.w3.org/ - or

use the fact that roXmlElement.parse() returns partial result - call genXml() on that and see which sections are missing; mutate the html, rinse & repeat...

It can be tricky but is doable. Note using a roRegEx also implies knowing the document structure.

Roku Community

Roku Developer Program

Parse But Not XML

Re: Parse But Not XML

Re: Parse But Not XML

Re: Parse But Not XML

Re: Parse But Not XML

Re: Parse But Not XML

Re: Parse But Not XML

Re: Parse But Not XML