Here is one way I have seen, using regular expressions to parse the HTML: http://w3guy.com/php-retrieve-web-page-titles-meta-tags/
How to fetch title from HTML/XML
So i would like to fetch title from url that user enter and show it as normal text. How would you do that?
It's not recommend to parse html with regex, even for simple things. See these posts:
http://blog.codinghorror.com/parsing-html-the-cthulhu-way/
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags
You could use phps built in DOMDocument
Example using DOMDocument to extract the title element from some html:
$title = '';
$dom = new DOMDocument();
if (@$dom->loadHTMLFile($url))
{
$elements = $dom->getElementsByTagName('title');
if ($elements->length > 0)
{
$title = $elements->item(0)->textContent;
}
}
Symfonys DomCrawler component would be another option - it's also included in Laravel:
use Symfony\Component\DomCrawler\Crawler;
$crawler = new Crawler($html);
$crawler->filterXPath('//title')->text();
Yeah, the regex stuff was the first that came to mind. The proper way would be to use the crawler mentioned above or go for Guzzle/Goutte solutions.
Here is one interesting approach I just found (you made me curious about this): http://zrashwani.com/simple-web-spider-php-goutte/
You can do some great stuff with DOMDocument but just pulling the page title, cmon...
The overhead isn't worth it. Read this: http://blog.futtta.be/2014/05/01/php-html-parsing-performance-shootout-regex-vs-dom/
Then do this: https://stackoverflow.com/questions/399332/fastest-way-to-retrieve-a-title-in-php
<?php
function page_title($url) {
$fp = file_get_contents($url);
if (!$fp)
return null;
$res = preg_match("/<title>(.*)<\/title>/siU", $fp, $title_matches);
if (!$res)
return null;
// Clean up title: remove EOL's and excessive whitespace.
$title = preg_replace('/\s+/', ' ', $title_matches[1]);
$title = trim($title);
return $title;
}
?>
Further, using DOMDocument, if the page isn't well formed you'll have problems and the same with different character encodings.
This is an ideal candidate for regular expressions over DOM.
@sitesense Are you sure this isn't a case of premature optimisation? What if later on you want to grab some other data from the page? With regards to the performance, if that is an issue you could always implement some form of caching.
For badly formed html there is: http://php.net/manual/en/book.tidy.php
@WookieMonster I could combat that with "YAGNI"... you aren't gonna need it :)
https://en.wikipedia.org/wiki/You_aren't_gonna_need_it
However, if it's the case that a user enters a url manually and wants to retrieve the title, then speed/resource use is no worry.
"implement some form of caching"... I don't think you've thought this through. We are talking about random websites that we take the title from and then abandon. Caching is the worst thing you could do :)
Just use the right tool for the job.
The overhead of DOMDocument + Tidy, to grab a page title seems way overboard to me.
If we programmed to "what if", we'd never complete a project :)
My philosophy... don't overthink!
It's definitely a large performance gain that shouldn't be ignored. I wrote a scraper in Python a few years ago that used regex over a dom parser to squeeze out extra performance... :)
Thnx guys, i try DOMDocuments and it serves me well, but from some pages i can't parse Title, also, what if i want to fetch excerpt from blog post? Where can i find element which i want to fetch? I tried to find other elements, but i'm not good at it at this point.
You might find the Symfony DOM crawler easier to use (see my first post), this uses XPaths to identify the element. You can find the XPath of an element in chrome by bringing up the inspector, right clicking the node you want and selecting "copy XPath".
Please or to participate in this conversation.