Need code suggestion to identify similar news articles

Published 11 months ago by mathewparet

I have a requirement to identify similar "news" articles from among say 1000s. I have all the 1000 articles saved in a table with (title, description, date published (might vary -/+ one day).

Ideally, I am trying to identify if a particular news is reported in multiple news sources and if yes, club them together.

Is there a way I can accomplish this without using AI?


Can you explain in more detail what data you have and how you are able to verify if one article is the same as another?


I download and save normal RSS feeds into a table (I save title and description). Based on the data stored in these fields, I need to identify if there is a duplicate entry for the news (not exact duplicate record).

For example, source a reports "Dog landed on moon for the first time". Source b reports "Crown, a dog, landeds on the moon". I need to identify that both these are the same news. How do I do that!



I can give you a rough idea to do that. If you need to identify similar type of text, than you can use similar_text() php function. Than you set a benchmark that how many percentage you will allow. If similarities is more than 50%, than you can say that is a similar post.


Of course, there are few other ways to solve this one.


I've tried that already, but doesn't help. I get a lot of false positives even at about 65%.

So I am looking for an alternative.

11 months ago (252,700 XP)

Have you looked at elasticsearch


I see.

Check this than if you are interested to use package.


Thanks @jlrdw & @tisuchi

They sound promising. Let me try them out.

Please sign in or create an account to participate in this conversation.