I'm trying to extract data from a pdf file and convert it into pandas dataframe I used 'fitz' from Pymupdf module to extract the data. and then with pandas i'm converting it into dataframe
path2 = r"D:\Eversana-CVs//Harshitha R Putane.pdf"
with fitz.open(path2) as doc:
pymupdf_text = ""
for page in doc:
pymupdf_text = page.getText()
print(pymupdf_text)
with open('pymupdf_text.pdf','w', encoding='utf-8') as f: #Converting to text file
f.write(pymupdf_text)
df=pd.read_table('pymupdf_text.txt',sep='\n') #Converting text file to dataframe
But similarly, I have a folder which contains many pdf documents. My goal is to read each pdf file one by one from the folder and do the text extraction and then convert it into dataframe. How can I do that in python?
CodePudding user response:
try this:
import PyPDF2
import re
for k in range(1,100):
# open the pdf file
object = PyPDF2.PdfFileReader("C:/my_path/file%s.pdf"%(k))
# get number of pages
NumPages = object.getNumPages()
# extract text and do the search
for i in range(0, NumPages):
PageObj = object.getPage(i)
print("this is page " str(i))
Text = PageObj.extractText()
# print(Text)
or this:
from pdfminer.pdfpage import PDFPage
allyourfiles = os.listdir(fold)
firstpdf = ""
for i in allyourfiles:
if '.pdf' in i:
firstpdf = i
break
with open('F:/technophile/Proj/SOURCE/' firstpdf, 'rb') as fh:
for page in PDFPage.get_pages(fh, caching=True, check_extractable=True):
page_interpreter.process_page(page)
text = fake_file_handle.getvalue()
allyourpdf.append(text)
CodePudding user response:
You can use pathlib builtin function to list out all the pdfs in your directory
from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("<path>/<to>/<pdfs>/").glob("*.pdf")
# convert the glob generator out put to list
# skip this if you are comfortable with generators and pathlib
pdf_files = [file.name for file in pdf_search]
Now you can simply run your block of code in a loop to iterate over the pdfs.
for example:
for pdf in pdf_files:
with fitz.open(pdf) as doc:
...
