Be part of JetBrains PHPverse 2026 on June 9 – a free online event bringing PHP devs worldwide together.

bearbytestudio's avatar

Advice on best way to accurately scrape data from multiple PDF invoice templates

Hi all, I have a client project, that needs to import a load of invoices (PDFs), and scrape all the data from them and accurately populate the Invoice model with data such as Invoice Number, Supplier, Due Date etc. Also populate the LineItems model with quantity, item cost, net cost, taxes etc etc. The tricky thing is, the invoices are from loads of different suppliers, who all use different invoice templates. Some fields don't even match between suppliers.

How should I go about handling this? I'm currently scraping the text from the invoice, then sending that to the OpenAI api telling it the JSON structure that what I want returned. This works well 80%-90% of the time, but this needs to be perfect.

All advice welcome!

0 likes
2 replies
nexxai's avatar

There isn't going to be a "perfect" solution so you can basically stop looking. There are services that claim to be extremely accurate (Microsoft Azure Cognitive Services Document Intelligence for one), but at the end of the day, OCRing text from a PDF and then also understanding its context is not a solvable problem (in polynomial time, at least) because by definition, it's not structured data.

Please or to participate in this conversation.