I trying extract same text from same pdf files but pypdf2 can't return anything for me pypdf2 can read page numbers but it can't read or extract texts
anyone can help me to resolve this problem?
this is my code:
from PyPDF2 import PdfFileReader
from pathlib import Path
openedFile = open("Desktop/sa.pdf" , "rb")
pdf = PdfFileReader(openedFile)
page_nums = pdf.getNumPages()
print(page_nums)
page = pdf.getPage(0)
text = page.extractText()
print(text)
and result is:
kernel@kernel-IT:~$ /bin/python3 /home/kernel/Desktop/pdf.py
1
without any error
CodePudding user response:
If you need to extract text from the file, try using pdfbox. It's interface isn't perfect(didn't find a way to get the text without saving a .txt file), but you can use it like so
import pdfbox
import os
def extract_text(doc_path):
p.extract_text(doc_path)
txt_path = doc_path[:-3] 'txt'
with open(txt_path, 'r',encoding= "utf_8_sig") as f:
text = f.read()
os.remove(txt_path)
return text
p = pdfbox.PDFBox()
print(extract_text("path/to/pdf"))
Run pip install python-pdfbox to install it
CodePudding user response:
Nobody can't help me? It's veey important for me
