Published 7 months ago by niseku
I want to build a tool, where users of my application can read a PDF and write a comment for each chapter and answer some questions about that chapter.
Now my question ist: is there a way to get all headings (chapter structure) out of the uploaded PDF to create a form for each chapter?
Thanks for your help!
The "basic" PDF file doesn't even know what a heading is. Simplified speaking, the file contains "instructions" what text should be drawn at a certain position on a page in a given font and size. Based on the font size and other criteria you might be able conclude that it is a heading. But what seems to be text when you view a file could also be a vector or raster image.
PDF 1.3 introduced features for incorporating structural information into the PDF file (called "logical structure"). PDF 1.4 introduced tagged PDF. Generating PDF files through a printer driver usually looses all the semantic information. A program that is able to create accessible PDF files will generate tagged PDF files.
Given you have tagged PDF files you'd still need a library that is able to extract that information. Before you ask, I'm not aware of any.
For more information have a look at https://www.w3.org/TR/WCAG20-TECHS/pdf_notes.html
The ISO PDF standard isn't free but you might want to read up on these topics in the older PDF 1.6 reference: https://www.adobe.com/devnet/pdf/pdf_reference_archive.html