Home > OS >  Python code to extract txt from PDF document
Python code to extract txt from PDF document

Time:01-25

I have been trying to convert some PDFs into .txt, but most sample codes I found online have the same issue: They only convert one page at a time. I am kinda new to python, and I am not finding how to write a substitute for the .GetPage() method to convert the entire document at once. All help is welcomed.

import PyPDF2
 
pdfFileObject = open(r"F:\pdf.pdf", 'rb')
 
pdfReader = PyPDF2.PdfFileReader(pdfFileObject)
 
print(" No. Of Pages :", pdfReader.numPages)
 
pageObject = pdfReader.getPage(0)
 
print(pageObject.extractText())
 
pdfFileObject.close()

CodePudding user response:

You could do this with a for loop. Extract the text from the pages in the loop and append them to a list.

import PyPDF2

pages_text=[]
with open(r"F:\pdf.pdf", 'rb') as pdfFileObject:
    pdfReader = PyPDF2.PdfFileReader(pdfFileObject)

    print(" No. Of Pages :", pdfReader.numPages)
    for page in range(pdfReader.numPages):
        pageObject = pdfReader.getPage(page)
        pages_text.append(pageObject.extractText())

print(pages_text)
  •  Tags:  
  • Related