jbrave
Channel Surfer
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-15-2011
12:59 AM
html parsing with Regex
I'm attempting to use this:
which looks like this:
to parse html. Now as I understand it, using:
where html is an html page, should give me an array of all tags in the HTML, but I"m getting nothing back.
- Joel
"/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\".*?\"|'.*?'|[^'\">\s]+))?)+\s*|\s*)\/?>/i"
which looks like this:
"/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\"+chr(34)+".*?\"+chr(34)+"|"+chr(39)+".*?"+chr(39)+"|[^"+chr(39)+"\"+chr(34)+">\s]+))?)+\s*|\s*)\/?>/i"
to parse html. Now as I understand it, using:
test=createobject("roregex","/<\/?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\"+chr(34)+".*?\"+chr(34)+"|"+chr(39)+".*?"+chr(39)+"|[^"+chr(39)+"\"+chr(34)+">\s]+))?)+\s*|\s*)\/?>/i","m")
result=test.match(html)
where html is an html page, should give me an array of all tags in the HTML, but I"m getting nothing back.
- Joel
Screenshades: The first Screensaver for Roku2!
Musiclouds: The best free internet music, on your Roku!
Ouroborialis: Psychedelic Screensaver for Roku!
Musiclouds: The best free internet music, on your Roku!
Ouroborialis: Psychedelic Screensaver for Roku!
2 REPLIES 2


Roku Employee
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-15-2011
09:44 AM
Re: html parsing with Regex
I see a couple syntactic things. The leading / and the trailing /i should be omitted. roRegEx doesn't need the / delimiters around your expression and the i (for case-insensitive), belongs in the third argument with the m. Also, you don't need to escape the forward slashes within your expression. As for the expression itself, I haven't had nearly enough caffeine yet today to even attempt.
I don't know the context of what you're trying to do, but I wonder if roRegEx.Split() might be more useful. If all you want are the tags from a piece of HTML, splitting on "<" and then working on the smaller chunks might be easier.
test = CreateObject("roregex","</?\w+((\s+(\w|\w[\w-]*\w)(\s*=\s*(?:\"+chr(34)+".*?\"+chr(34)+"|"+chr(39)+".*?"+chr(39)+"|[^"+chr(39)+"\"+chr(34)+">\s]+))?)+\s*|\s*)/?>","im")
I don't know the context of what you're trying to do, but I wonder if roRegEx.Split() might be more useful. If all you want are the tags from a piece of HTML, splitting on "<" and then working on the smaller chunks might be easier.
kbenson
Visitor
- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
04-15-2011
02:08 PM
Re: html parsing with Regex
The first rule of using a regex to parse HTML is don't do it.
The second rule of using a regex to parse HTML is PLEASE don't do it.
Below you'll find your regex in expanded form (it's a wonderful advancement in regexes, all whitespace is ignored and comments can be introduced), and commented. Maybe that will help you spot something.
I generally find it useful to reduce the regex to the bare minimum, send it against some test content, and keep adding back in stuff I think should work bit by bit until it breaks.
The second rule of using a regex to parse HTML is PLEASE don't do it.
Below you'll find your regex in expanded form (it's a wonderful advancement in regexes, all whitespace is ignored and comments can be introduced), and commented. Maybe that will help you spot something.
I generally find it useful to reduce the regex to the bare minimum, send it against some test content, and keep adding back in stuff I think should work bit by bit until it breaks.
/
< # Begin tag
\/? # Optional slash denoting a closing element
\w+ # Tag name
( # MATCH 1
(\s+ # MATCH 2
\s+ # REQUIRED spaces
(\w|\w[\w-]*\w) # MATCH 3 (attribute), Single char or multiple chars with optional hyphen bordered by chars
( # MATCH 4, OPTIONAL
\s*=\s* # = with optional bordering spaces
(?: # Non-matching group
\".*?\" # Non-greedy match of double quoted string
|
'.*?' # Non-greedy match of single quoted string
|
[^'\">\s]+ # NOT single or double quote, right angle, or space 1+ times
)
)?
)+
\s*|\s*
)
\/? # Optional ending slash denoting self closing element
>
/xi
-- GandK Labs
Check out Reversi! in the channel store!
Check out Reversi! in the channel store!