If not RegEx, what's the right way to parse the HTML content?

It looks like using regular expressions to perform operations like find and replace aren't encouraged [ Ref: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ].

I wish to know if there's other, better way to parse user generated content and keep it safe? Of course, I'm aware of DomDocument and the Purify libraries that help to a certain extent. But let's say you've to perform following tasks(I could do a few of these using DomDocument) -

Extract @mentioned users from the text.
Get oEmbedable links from the text.
Add 'nofollow' links only to the outgoing links
Convert http into https etc.

What's your approach to parsing html? Would really appreciate your suggestions. Thanks!

Cronix

8 years ago

Level 67

I wouldn't worry about regex in your use-case (one time conversion of a database). See the 2nd answer in that question on SO you linked to.

It's a little more forgiving to use DomDocument though, depending on the HTML you're trying to parse. Coming up with the correct end-all be-all regex parser isn't the easiest thing with all of the inconsistencies that are probable... "user generated content"

spekkionu

8 years ago

Level 48

You might want to take a look at the Symfony DomCrawler component

http://symfony.com/doc/current/components/dom_crawler.html

If you also have the CssSelector component installed you can use css selectors to parse the DOM tree which is a lot easier than xpath selectors or trying to find things directly with DOMDocument.

http://symfony.com/doc/current/components/css_selector.html

Please or to participate in this conversation.