jeroenvanrensen's avatar

How to check if a URL is an article or just a web page?

Hi everyone,

I've got a complex question: How to check if a URL is an article or just a web page?

For example, nos.nl is a web page, but nos.nl/artikel/2339900-kabinet-past-corona-spoedwet-aan-na-kritiek.html is a textual article.

I want to recreate something like Pocket, and if you safe a web page, it automatically knows if it's a text article or a page.

I can't say it's an article if there's an article tag on the page, Ican't just count the words, so what can I do?

Thank you! Jeroen

0 likes
11 replies
deepu07's avatar

@jeroenvanrensen if you think you can get URL extensions I guess you can write custom validations based on the extensions.

jeroenvanrensen's avatar

Hi @deepu07,

I could do that, but I want it to work for every website on the internet. Do you know some algorithm I could follow/use?

Jeroen

frankielee's avatar

Maybe this will work, check the content type of the header after requesting the web page

$response = $client->request('PUT', '/put', ['json' => ['foo' => 'bar']]);
echo $request->getHeaderLine('Content-Type');
DennisEilander's avatar
Level 5

Hi @jeroenvanrensen,

Most (modern) websites make use of the OpenGraph protocol (ogp.me). If so, you can use a webscraper to check if the meta data of that page contains a og:type property: <meta property="og:type" content="article">.

Based on this meta tag, you can determine if the page is an article, or what kind of type the object is.

However, this only works for websites which are using the OpenGraph protocol.

Maybe you need to add multiple checks to determine if the page is an article.

I have no experiences with web scrapers, but I think you can use this package for that: https://github.com/dweidner/laravel-goutte

1 like
bharathkumar@recosenselabs.com's avatar

india[dot]com/video-gallery/ This URL is not a article but in the meta tag they have mentioned as an article.

ederson's avatar

You say you want it to work for every page .......

Unless the site uses Opengraph as @denniseilander said it can be next to impossible to do (unless you are google)

You could you use machine learning to teach your script what an article is. No idea how though......

Looking for keywords in the webpage text could have some success.

RuskinF's avatar

This ought to tell you what the webpage is about: Use AI-based spider bots that scan a page on the Internet and give you details about the page on your system.

As @ederson mentioned you would have to teach your script what an article is using machine learning.

Please or to participate in this conversation.