Roku Developer Program

Join our online forum to talk to Roku developers and fellow channel creators. Ask questions, share tips with the community, and find helpful resources.
cancel
Showing results for 
Show  only  | Search instead for 
Did you mean: 
jbrave
Channel Surfer

html parsing with Regex

I'm attempting to use this:


"/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i"


which looks like this:


"/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\"+chr(34)+".*?\"+chr(34)+"|"+chr(39)+".*?"+chr(39)+"|[^"+chr(39)+"\"+chr(34)+">\s]+))?)+\s*|\s*)\/?>/i"


to parse html. Now as I understand it, using:


test=createobject("roregex","/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\"+chr(34)+".*?\"+chr(34)+"|"+chr(39)+".*?"+chr(39)+"|[^"+chr(39)+"\"+chr(34)+">\s]+))?)+\s*|\s*)\/?>/i","m")

result=test.match(html)


where html is an html page, should give me an array of all tags in the HTML, but I"m getting nothing back.

- Joel
Screenshades: The first Screensaver for Roku2!
Musiclouds: The best free internet music, on your Roku!
Ouroborialis: Psychedelic Screensaver for Roku!
0 Kudos
2 REPLIES 2
RokuChris
Roku Employee
Roku Employee

Re: html parsing with Regex

I see a couple syntactic things. The leading / and the trailing /i should be omitted. roRegEx doesn't need the / delimiters around your expression and the i (for case-insensitive), belongs in the third argument with the m. Also, you don't need to escape the forward slashes within your expression. As for the expression itself, I haven't had nearly enough caffeine yet today to even attempt.

test = CreateObject("roregex","</?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\"+chr(34)+".*?\"+chr(34)+"|"+chr(39)+".*?"+chr(39)+"|[^"+chr(39)+"\"+chr(34)+">\s]+))?)+\s*|\s*)/?>","im")


I don't know the context of what you're trying to do, but I wonder if roRegEx.Split() might be more useful. If all you want are the tags from a piece of HTML, splitting on "<" and then working on the smaller chunks might be easier.
0 Kudos
kbenson
Visitor

Re: html parsing with Regex

The first rule of using a regex to parse HTML is don't do it.
The second rule of using a regex to parse HTML is PLEASE don't do it.

Below you'll find your regex in expanded form (it's a wonderful advancement in regexes, all whitespace is ignored and comments can be introduced), and commented. Maybe that will help you spot something.

I generally find it useful to reduce the regex to the bare minimum, send it against some test content, and keep adding back in stuff I think should work bit by bit until it breaks.


/
< # Begin tag
\/? # Optional slash denoting a closing element
\w+ # Tag name
( # MATCH 1
(\s+ # MATCH 2
\s+ # REQUIRED spaces
(\w|\w[\w-]*\w) # MATCH 3 (attribute), Single char or multiple chars with optional hyphen bordered by chars
( # MATCH 4, OPTIONAL
\s*=\s* # = with optional bordering spaces
(?: # Non-matching group
\".*?\" # Non-greedy match of double quoted string
|
'.*?' # Non-greedy match of single quoted string
|
[^'\">\s]+ # NOT single or double quote, right angle, or space 1+ times
)
)?
)+
\s*|\s*
)
\/? # Optional ending slash denoting self closing element
>
/xi
-- GandK Labs
Check out Reversi! in the channel store!
0 Kudos