@jeroenvanrensen if you think you can get URL extensions I guess you can write custom validations based on the extensions.
How to check if a URL is an article or just a web page?
Hi everyone,
I've got a complex question: How to check if a URL is an article or just a web page?
For example, nos.nl is a web page, but nos.nl/artikel/2339900-kabinet-past-corona-spoedwet-aan-na-kritiek.html is a textual article.
I want to recreate something like Pocket, and if you safe a web page, it automatically knows if it's a text article or a page.
I can't say it's an article if there's an article tag on the page, Ican't just count the words, so what can I do?
Thank you! Jeroen
Hi @jeroenvanrensen,
Most (modern) websites make use of the OpenGraph protocol (ogp.me).
If so, you can use a webscraper to check if the meta data of that page contains a og:type property: <meta property="og:type" content="article">.
Based on this meta tag, you can determine if the page is an article, or what kind of type the object is.
However, this only works for websites which are using the OpenGraph protocol.
Maybe you need to add multiple checks to determine if the page is an article.
I have no experiences with web scrapers, but I think you can use this package for that: https://github.com/dweidner/laravel-goutte
Please or to participate in this conversation.