Be part of JetBrains PHPverse 2026 on June 9 – a free online event bringing PHP devs worldwide together.

thebigk's avatar
Level 13

If not RegEx, what's the right way to parse the HTML content?

It looks like using regular expressions to perform operations like find and replace aren't encouraged [ Ref: https://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags ].

I wish to know if there's other, better way to parse user generated content and keep it safe? Of course, I'm aware of DomDocument and the Purify libraries that help to a certain extent. But let's say you've to perform following tasks(I could do a few of these using DomDocument) -

  1. Extract @mentioned users from the text.
  2. Get oEmbedable links from the text.
  3. Add 'nofollow' links only to the outgoing links
  4. Convert http into https etc.

What's your approach to parsing html? Would really appreciate your suggestions. Thanks!

0 likes
2 replies
Cronix's avatar

I wouldn't worry about regex in your use-case (one time conversion of a database). See the 2nd answer in that question on SO you linked to.

It's a little more forgiving to use DomDocument though, depending on the HTML you're trying to parse. Coming up with the correct end-all be-all regex parser isn't the easiest thing with all of the inconsistencies that are probable... "user generated content"

Please or to participate in this conversation.