Dec 4, 2014

Level 10

How to fetch title from HTML/XML

So i would like to fetch title from url that user enter and show it as normal text. How would you do that?

Level 23

Here is one way I have seen, using regular expressions to parse the HTML: http://w3guy.com/php-retrieve-web-page-titles-meta-tags/

DarkRoast

11 years ago

Level 8

It's not recommend to parse html with regex, even for simple things. See these posts:

http://blog.codinghorror.com/parsing-html-the-cthulhu-way/

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags

You could use phps built in DOMDocument

Example using DOMDocument to extract the title element from some html:

$title = '';
$dom = new DOMDocument();
if (@$dom->loadHTMLFile($url))
{
  $elements = $dom->getElementsByTagName('title');
  if ($elements->length > 0)
  {
    $title = $elements->item(0)->textContent;
  }
}

Symfonys DomCrawler component would be another option - it's also included in Laravel:

use Symfony\Component\DomCrawler\Crawler;

$crawler = new Crawler($html);
$crawler->filterXPath('//title')->text();

1 like

noeldiaz

11 years ago

Level 23

Yeah, the regex stuff was the first that came to mind. The proper way would be to use the crawler mentioned above or go for Guzzle/Goutte solutions.

Here is one interesting approach I just found (you made me curious about this): http://zrashwani.com/simple-web-spider-php-goutte/

sitesense

11 years ago

Level 19

You can do some great stuff with DOMDocument but just pulling the page title, cmon...

The overhead isn't worth it. Read this: http://blog.futtta.be/2014/05/01/php-html-parsing-performance-shootout-regex-vs-dom/

Then do this: https://stackoverflow.com/questions/399332/fastest-way-to-retrieve-a-title-in-php

<?php
    function page_title($url) {
        $fp = file_get_contents($url);
        if (!$fp) 
            return null;

        $res = preg_match("/<title>(.*)<\/title>/siU", $fp, $title_matches);
        if (!$res) 
            return null; 

        // Clean up title: remove EOL's and excessive whitespace.
        $title = preg_replace('/\s+/', ' ', $title_matches[1]);
        $title = trim($title);
        return $title;
    }
?>

Further, using DOMDocument, if the page isn't well formed you'll have problems and the same with different character encodings.

This is an ideal candidate for regular expressions over DOM.

DarkRoast

11 years ago

Level 8

@sitesense Are you sure this isn't a case of premature optimisation? What if later on you want to grab some other data from the page? With regards to the performance, if that is an issue you could always implement some form of caching.

For badly formed html there is: http://php.net/manual/en/book.tidy.php

1 like

sitesense

11 years ago

Level 19

@WookieMonster I could combat that with "YAGNI"... you aren't gonna need it :)

https://en.wikipedia.org/wiki/You_aren't_gonna_need_it

However, if it's the case that a user enters a url manually and wants to retrieve the title, then speed/resource use is no worry.

"implement some form of caching"... I don't think you've thought this through. We are talking about random websites that we take the title from and then abandon. Caching is the worst thing you could do :)

Just use the right tool for the job.

The overhead of DOMDocument + Tidy, to grab a page title seems way overboard to me.

If we programmed to "what if", we'd never complete a project :)

My philosophy... don't overthink!

1 like

DarkRoast

11 years ago

Level 8

It's definitely a large performance gain that shouldn't be ignored. I wrote a scraper in Python a few years ago that used regex over a dom parser to squeeze out extra performance... :)

1 like

sasafister

11 years ago

Level 10

Thnx guys, i try DOMDocuments and it serves me well, but from some pages i can't parse Title, also, what if i want to fetch excerpt from blog post? Where can i find element which i want to fetch? I tried to find other elements, but i'm not good at it at this point.

DarkRoast

11 years ago

Level 8

You might find the Symfony DOM crawler easier to use (see my first post), this uses XPaths to identify the element. You can find the XPath of an element in chrome by bringing up the inspector, right clicking the node you want and selecting "copy XPath".

1 like

Please or to participate in this conversation.