Roku Developer Program

Developers and content creators—a complete solution for growing an audience directly.
cancel
Showing results for 
Search instead for 
Did you mean: 
Highlighted
btpoole
Level 7

Parse But Not XML

Have a situation where I am forced to use an html source which contains no xml structure. I am pulling out bits of information that falls between the <h1></h1> tags. I am using Instr and mid to get the information. My problem is that I can't figure out how to run thru the complete file. The script runs, hits the first set of tags, pulls the info then exits. Just not sure how to get script to get each <h1></h1>. In xml I know I can use For each to cycle thru. Any suggestions?
0 Kudos
7 Replies
NewManLiving
Level 7

Re: Parse But Not XML

Off the top of my head I would Consider walking through the string with instr and the mid functions checking the instr return index, or you could possibly split it into an array of lines using tokenize or a combo of replace and tokenize
My Channels: 2D API Framework Presentation: https://owner.roku.com/add/2M9LCVC
Updated: 11-11-2015 - Completed Keyboard interface
The Joel Channel ( Final Beta )
0 Kudos
belltown
Level 7

Re: Parse But Not XML

Sub Main ()
html = "<h1>First Header</h1><div><h11>Blah<h1>Second Header</h1></h11><h1>Third Header</h1><h2>H2 Header<h2>"
list = []
re = CreateObject ("roRegex", ".*?<h1>(.*?)</h1>(.*)", "is")
ma = re.Match (html)
While ma.Count () > 2
list.Push (ma [1])
ma = re.Match (ma [2])
End While
Print list
End Sub
https://github.com/belltown/
0 Kudos
NewManLiving
Level 7

Re: Parse But Not XML

Nice example - one for the toolbox
My Channels: 2D API Framework Presentation: https://owner.roku.com/add/2M9LCVC
Updated: 11-11-2015 - Completed Keyboard interface
The Joel Channel ( Final Beta )
0 Kudos
btpoole
Level 7

Re: Parse But Not XML

Very nice and helpful. Thank you greatly. Have not used regular expression much in other scripts, thought it was little confusing. I will have to look more into it.
Thanks again
0 Kudos
EnTerr
Level 8

Re: Parse But Not XML

Why don't you just give it the XML treatment with roXmlElemnet.parse()? And then walk the tree as if it were XML.

That will work with many HTMLs and even the remaining ones, most can be pre-patched for issues before a xml-parse.
0 Kudos
quartern
Level 7

Re: Parse But Not XML

Sorry for commenting on such an old post

"EnTerr" wrote:
Why don't you just give it the XML treatment with roXmlElemnet.parse()? And then walk the tree as if it were XML.

That will work with many HTMLs and even the remaining ones, most can be pre-patched for issues before a xml-parse.


Without knowing what the parse is failing on how would I know what to pre-patch?
Private apps: IsraTV (replaces IsraIBA, IsraNews2, IsraI24, Isra10, Isra20)
Users - to report issues with the app (not content of streams please) send me a tweet - @quartern_roku and follow (so we can DM)
0 Kudos
EnTerr
Level 8

Re: Parse But Not XML

"quartern" wrote:
Sorry for commenting on such an old post

No sweat - the dev.forum moves at near light-speed :?: -
i mean there is a time dilation and to an external stationary observer is seems couple of years pass between e.g. a bug being reported and fixed Smiley LOL

Without knowing what the parse is failing on how would I know what to pre-patch?

Well, the "pre-patching" idea implies that you actually know in advance what's wrong with the HTML that makes it un-parsable. In other words can't feed it any random document from the World Wild Web. As of how to figure out where it fails, you can either
  • use validator to sniff what's fishy, e.g. https://validator.w3.org/ - or

  • use the fact that roXmlElement.parse() returns partial result - call genXml() on that and see which sections are missing; mutate the html, rinse & repeat...

It can be tricky but is doable. Note using a roRegEx also implies knowing the document structure.
0 Kudos