Be part of JetBrains PHPverse 2026 on June 9 – a free online event bringing PHP devs worldwide together.

sasafister's avatar

Ecommerce product matching from different websites

I have an interesting and complex problem.

I'm scraping few websites for price comparison/monitoring app. The problem is that N websites have same product, but different title/sku/image/categories and I don't see a way how I could match product X from site A to be same (or at least 90% similar) to product from site B.

What are your thoughts?

0 likes
5 replies
Dalma's avatar

Ideally you would look to map each of these back to a Manufacturer's SKU or model code. How and where you find it on the screens you are scraping will be the challenge.

PatrickSJ's avatar

Use https://www.upcitemdb.com/ API to lookup up scraped items.

Edit: When scraping look for UPC and Mfg Part#. You can also search by title, but the UPC and Mfg part# are mostly likely to give you the matches you want.

sikic's avatar

You need to calculate Levenshtein distance between two strings. If you want to use done library and match by percentage, then https://github.com/wyndow/fuzzywuzzy is what you are looking for.

// calculates the difference in percentage between tho strings
>>> $fuzz->ratio('this is a test', 'this is a test!')
=> 96
jlrdw's avatar

Quote from an article:

Websites change their layouts often (breaking web scrapers) and APIs can change as well. The traditional process outlined here will require regular maintenance to ensure data integrity.

So it's not a program and forget thing.

sasafister's avatar

@dalma That would be very nice, but none of them don't have any kind of unique identifier. I couldn't even find shop's SKU, only identifier I have is URL slug without ID.

@patricksj unfortunately I don't have UPC/MFG # or any such identifier.

@sikic this may work for few sites since they have similar titles, but not all. Most of them have title format such as "category name, size | brand" or other random tags. This would result in many false positives.

@jlrdw Yeah, I had such problems already. Within a month one web pushed redesign few times. Web response is random, sometimes contains image - sometimes does not, etc.

Thank you all for your help. I decided to use different approach to this problem so I won't use product matching. What I found and may be useful for some is enterprise grade softwares and services for such product matching - which is in my case huge overkill and too expensive.

Please or to participate in this conversation.