Need code suggestion to identify similar news articles

Published 3 weeks ago by mathewparet

I have a requirement to identify similar "news" articles from among say 1000s. I have all the 1000 articles saved in a table with (title, description, date published (might vary -/+ one day).

Ideally, I am trying to identify if a particular news is reported in multiple news sources and if yes, club them together.

Is there a way I can accomplish this without using AI?

mattsplat

Can you explain in more detail what data you have and how you are able to verify if one article is the same as another?

mathewparet

I download and save normal RSS feeds into a table (I save title and description). Based on the data stored in these fields, I need to identify if there is a duplicate entry for the news (not exact duplicate record).

For example, source a reports "Dog landed on moon for the first time". Source b reports "Crown, a dog, landeds on the moon". I need to identify that both these are the same news. How do I do that!

tisuchi
tisuchi
3 weeks ago (265,395 XP)

@mathewparet

I can give you a rough idea to do that. If you need to identify similar type of text, than you can use similar_text() php function. Than you set a benchmark that how many percentage you will allow. If similarities is more than 50%, than you can say that is a similar post.

Ref: http://php.net/manual/en/function.similar-text.php

Of course, there are few other ways to solve this one.

mathewparet

I've tried that already, but doesn't help. I get a lot of false positives even at about 65%.

So I am looking for an alternative.

jlrdw
jlrdw
3 weeks ago (200,050 XP)

Have you looked at elasticsearch https://www.elastic.co/products/elasticsearch

tisuchi
tisuchi
3 weeks ago (265,395 XP)

I see.

Check this than if you are interested to use package. https://github.com/atomescrochus/laravel-string-similarities

mathewparet

Thanks @jlrdw & @tisuchi

They sound promising. Let me try them out.

Please sign in or create an account to participate in this conversation.