Area of coding: PDF Table of Contents in python3 using pyPDF2
Problem: I need a program that can iterate through a union variable that contains multiple dictionaries, then multiple lists which contains multiple dictionaries.
[
{},
[{}, {}, {}],
{},
[{}, {}, {}],
{},
[{}, {}, {}]
]
This pattern repeats multiple times.
Expected output: The output should look like this
1 Title Goes Here
1.1 Title Goes Here
1.1.1 Title Goes Here
1.1.2 Title Goes Here
1.1.3 Title Goes Here
1.2 Title Goes Here
1.2.1 Title Goes Here
1.2.2 Title Goes Here
1.2.3 Title Goes Here
1.3 Title Goes Here
1.3.1 Title Goes Here
1.3.2 Title Goes Here
1.3.3 Title Goes Here
2 Title Goes Here
2.1 Title Goes Here
2.1.1 Title Goes Here
2.1.2 Title Goes Here
2.1.3 Title Goes Here
2.2 Title Goes Here
2.2.1 Title Goes Here
2.2.2 Title Goes Here
2.2.3 Title Goes Here
2.3 Title Goes Here
2.3.1 Title Goes Here
2.3.2 Title Goes Here
2.3.3 Title Goes Here
Program:
import argparse as arp
from PyPDF2 import PdfFileReader
parser = arp.ArgumentParser()
parser.add_argument("-f", "--file", help="File to analyse")
arg = parser.parse_args()
filename = arg.file
def fileread():
doc = PdfFileReader(filename)
ToC = doc.getOutlines()
# ToC: Union[List[Union[Destination, list]], {__eq__}] = doc.getOutlines()
for elements in ToC:
#print(elements)
#print("\n")
try:
if elements is {}: # If the element is a dictionary just find the Title
print(elements['/Title']) # TODO: This is just skipped
else: # If the element is a list go through and print out the titles
for nest_dict in elements:
try:
print(nest_dict["/Title"])
except:
continue
except:
continue
fileread()
I'm testing this program on: Compilers - Principles, Techniques, and Tools-Pearson_Addison Wesley (2006).pdf
Any help is much appreciated.
CodePudding user response:
This line is not right:
if elements is {}: # If the element is a dictionary just find the Title
It should instead read:
if isinstance(elements, dict):
CodePudding user response:
With the code below, I am able to get such output from your pdf file:
Output:
1 Introduction
1.1 Language Processors
1.1.1 Exercises for Section 1.1
1.2 The Structure of a Compiler
...
2 A Simple Syntax-Directed Translator
2.1 Introduction
2.2 Syntax Definition
2.2.1 Definition of Grammars
...
Python code:
import argparse as arp
from PyPDF2 import PdfFileReader
parser = arp.ArgumentParser()
parser.add_argument("-f", "--file", help="File to analyse")
arg = parser.parse_args()
filename = arg.file
def fileread():
doc = PdfFileReader(filename)
ToC = doc.getOutlines()
for elements in ToC:
try:
def print_title(input_data):
if isinstance(input_data, dict):
print(input_data['/Title'])
else:
for nest_dict in input_data:
try:
print_title(nest_dict)
except:
continue
print_title(elements)
except:
continue
fileread()
I'm not an expert in Python, but hope this will help you. By the way, you can read some info about recursions in Python here
