Get all headings (chapter structure) out of a PDF

Published 4 months ago by niseku

Hi

I want to build a tool, where users of my application can read a PDF and write a comment for each chapter and answer some questions about that chapter.

Now my question ist: is there a way to get all headings (chapter structure) out of the uploaded PDF to create a form for each chapter?

Thanks for your help!

Best Answer (As Selected By niseku)
skliche

The "basic" PDF file doesn't even know what a heading is. Simplified speaking, the file contains "instructions" what text should be drawn at a certain position on a page in a given font and size. Based on the font size and other criteria you might be able conclude that it is a heading. But what seems to be text when you view a file could also be a vector or raster image.

PDF 1.3 introduced features for incorporating structural information into the PDF file (called "logical structure"). PDF 1.4 introduced tagged PDF. Generating PDF files through a printer driver usually looses all the semantic information. A program that is able to create accessible PDF files will generate tagged PDF files.

Given you have tagged PDF files you'd still need a library that is able to extract that information. Before you ask, I'm not aware of any.

For more information have a look at https://www.w3.org/TR/WCAG20-TECHS/pdf_notes.html

The ISO PDF standard isn't free but you might want to read up on these topics in the older PDF 1.6 reference: https://www.adobe.com/devnet/pdf/pdf_reference_archive.html

mvd
mvd
4 months ago (22,440 XP)

Never used this library but you can try https://github.com/smalot/pdfparser

Features included :

    Load/parse objects and headers
niseku

@mvd i tested PDFParser but i can't find a solution to get all headings from the pdf. I think the mentioned headers in the feature list are more like http headers and not the header in the file itself...

but thanks for your help

skliche
skliche
4 months ago (149,490 XP)

The "basic" PDF file doesn't even know what a heading is. Simplified speaking, the file contains "instructions" what text should be drawn at a certain position on a page in a given font and size. Based on the font size and other criteria you might be able conclude that it is a heading. But what seems to be text when you view a file could also be a vector or raster image.

PDF 1.3 introduced features for incorporating structural information into the PDF file (called "logical structure"). PDF 1.4 introduced tagged PDF. Generating PDF files through a printer driver usually looses all the semantic information. A program that is able to create accessible PDF files will generate tagged PDF files.

Given you have tagged PDF files you'd still need a library that is able to extract that information. Before you ask, I'm not aware of any.

For more information have a look at https://www.w3.org/TR/WCAG20-TECHS/pdf_notes.html

The ISO PDF standard isn't free but you might want to read up on these topics in the older PDF 1.6 reference: https://www.adobe.com/devnet/pdf/pdf_reference_archive.html

niseku

@skliche thank's for your very understandable answer

Please sign in or create an account to participate in this conversation.