I'm trying to read pdf one by one and then converting it into dataframe-CodePudding

I've used 'fitz' from Pymupdf module to extract data and then with pandas converting the extracted data to dataframe.

#Code to read multiple pdfs from the folder:

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list

pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

#Code to extract the data:

for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        pypdf_text = ""
        for page in doc:
            pypdf_text  = page.getText()

But, the above code is only extracting the data for last pdf in the folder. and thus giving the result for only that pdf Although, the desired goal is to extract the data from all the pdfs in the folder one by one

Please help me understand and resolved why is this happening??

CodePudding user response：

Change the below code: Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf") to files_pdf = [ file for file in glob.glob(path "*.pdf",recursive=True)] and give path as a variable.

CodePudding user response：

Following code worked for me,

from pathlib import Path
# returns all file paths that has .pdf as extension in the specified directory
pdf_search = Path("C:/Users/Ayesha.Gondekar/Eversana-CVs/").glob("*.pdf")
# convert the glob generator out put to list

pdf_files = pdf_files = [str(file.absolute()) for file in pdf_search]

#Code to extract the data:

pdf_txt = ""
for pdf in pdf_files:
    with fitz.open(pdf) as doc:
        
        for page in doc:
            pdf_txt  = page.getText()

#Converting the extracted data to data frame:

with open('pdf_txt.txt','w', encoding='utf-8') as f: #Converting to text file
    f.write(pdf_txt)

data=pd.read_table('pdf_txt.txt',sep='\n')  #Converting text file to dataframe

Thank you @Yevhen Kuzmovych for your help!