I have multiple pdf files that I want to extract a group of specific pages from where each set of pages is different for each pdf file. I have created a dictionary with the keys as the pdf file name and the values as the list of pages to be extracted from each pdf file (shown as key). I intend to extract the given pages from the associated pdf file and write them all to one new pdf file so that I can do data extraction on this final file. I have tried PyPDF4 as well as FPDF but no joy as yet as it gives me either a large pdf with blank pages or a pdf with just 1 or 2 pages extracted or error that the pdf object cannot be found. I am hoping to get some guidance on where I am going wrong with my approach. Below is my code:
import PyPDF4
from PyPDF4 import PdfFileReader, PdfFileWriter
for pdf,pgs in dic_11_1.items():
pdf=list(dic_11_1.keys())
pgs=list(dic_11_1.values())
for i in range(0,len(pdf)):
pages = pgs[i]
object = open(pdf[i],'rb')
pdfinput=PyPDF4.PdfFileReader(object,'rb')
if pdfinput.isEncrypted:
pdfinput.decrypt('')
else:
pdfinput
for p in pages:
page=pdfinput.getPage(p)
pdf_writer=PyPDF4.PdfFileWriter()
pdf_writer.addPage(page)
with open('F111.pdf',mode='wb') as output:
pdf_writer.write(output)
The error that I get is 'PdfReadError: Could not find object.'
When I try FPDF with the following code, it runs a long time and gives me a large empty pdf file:
from fpdf import FPDF
import os
for pdf,pgs in dic_11_1.items():
pdf_in=open(pdf,'rb')
inputpdf=PdfFileReader(pdf_in,'rb')
if inputpdf.isEncrypted:
inputpdf.decrypt('')
else:
inputpdf
for p in pgs:
content=inputpdf.getPage(p).extractText()
pdf = FPDF('P','mm','A4')
pdf.add_page()
pdf.set_font("arial", size = 10)
for text in content:
text2=text.encode('latin-1', 'replace').decode('latin-1')
pdf.write(10,text2)
pdf.ln(8)
pdf.close()
return_byte_string=pdf.output('F_11_1.pdf','S').encode('latin-1')
pdf_file=open('F_11_1.pdf','wb')
pdf_file.write(return_byte_string)
pdf_file.close()
Any guidance would be greatly appreciated. Thank you in advance
CodePudding user response:
Well, the problem is you're not iterating properly. See comments in code for better understanding.
UPD. PyPDF4 seems to add only page references to PdfFileWriter until it's actually written to some file. So we can close input sources only at the end. Thus this method won't work for large files count (on linux it will be restricted by ulimit and by default it's 1024, so we can open only 1000 input files 1 output file 3 system streams - stdin, stdout, stderr; this limit can be enlarged with ulimit -n <count>).
from PyPDF4 import PdfFileReader, PdfFileWriter
sect_11_1 = [
('filename1.pdf', 0, 1),
('filename2.pdf', 0, 1),
]
# note pages are zero-numbered
dic_11_1 = {}
# if sect_11_1 is of another format
# [('filename', [0, 1]), ...]
# remove star below
for filename, *pages in sect_11_1:
dic_11_1.setdefault(filename, [])
dic_11_1[filename].extend(pages)
# if filenames are not repeated, you can do just
# dic_11_1 = dict(sect_11_1)
# you need single writer for all files, don't declare it in a loop
pdf_writer = PdfFileWriter()
open_files = []
try:
for filename, pages in dic_11_1.items():
# now you have filename and pages set to 'filename1.pdf' and [1, 3, 4]
# on second iteration they'll be set to 'filename2.pdf' and [0, 2, 3]
# ...
# don't use `object` as variable name: it's valid, but bad style
# (it shadows builtin `object`)
src = open(filename, 'rb')
open_files.append(src)
pdfinput = PdfFileReader(src, 'rb')
if pdfinput.isEncrypted:
pdfinput.decrypt('')
# you don't need empty `else`
for p in pages:
# you might want to use `p - 1` instead if your input was 1-numbered
page = pdfinput.getPage(p)
pdf_writer.addPage(page)
# when all pages are added, write to output
with open('F111.pdf',mode='wb') as output:
pdf_writer.write(output)
finally: # if something was wrong, do it anyway
for f in open_files:
f.close() # we shouldn't keep files open after program run
