How to read multiple pdf from a folder one by one-CodePudding

I'm trying to extract data from a pdf file and convert it into pandas dataframe I used 'fitz' from Pymupdf module to extract the data. and then with pandas i'm converting it into dataframe

path2 = r"D:\Eversana-CVs//Harshitha R Putane.pdf"
with fitz.open(path2) as doc:
    pymupdf_text = ""
    for page in doc:
        pymupdf_text  = page.getText()
print(pymupdf_text)

with open('pymupdf_text.pdf','w', encoding='utf-8') as f: #Converting to text file
    f.write(pymupdf_text)

df=pd.read_table('pymupdf_text.txt',sep='\n')  #Converting text file to dataframe

But similarly, I have a folder which contains many pdf documents. My goal is to read each pdf file one by one from the folder and do the text extraction and then convert it into dataframe. How can I do that in python?

CodePudding user response：

try this:

import PyPDF2
import re

for k in range(1,100):
    # open the pdf file
    object = PyPDF2.PdfFileReader("C:/my_path/file%s.pdf"%(k))

    # get number of pages
    NumPages = object.getNumPages()


    # extract text and do the search
    for i in range(0, NumPages):
        PageObj = object.getPage(i)
        print("this is page "   str(i)) 
        Text = PageObj.extractText() 
        # print(Text)

or this:

from pdfminer.pdfpage import PDFPage
allyourfiles = os.listdir(fold)
firstpdf = ""
for i in allyourfiles:
    if '.pdf' in i:
        firstpdf = i
        break

with open('F:/technophile/Proj/SOURCE/' firstpdf, 'rb') as fh:

    for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
        page_interpreter.process_page(page)

    text = fake_file_handle.getvalue()
    allyourpdf.append(text)

CodePudding user response：

You can use pathlib builtin function to list out all the pdfs in your directory

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("<path>/<to>/<pdfs>/").glob("*.pdf")
# convert the glob generator out put to list
# skip this if you are comfortable with generators and pathlib
pdf_files = [file.name for file in pdf_search]

Now you can simply run your block of code in a loop to iterate over the pdfs.

for example:

for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        ...