extract text with pypdf2 in python3-CodePudding

I trying extract same text from same pdf files but pypdf2 can't return anything for me pypdf2 can read page numbers but it can't read or extract texts

anyone can help me to resolve this problem?

this is my code:

from PyPDF2 import PdfFileReader
from pathlib import Path

openedFile = open("Desktop/sa.pdf" , "rb")
pdf = PdfFileReader(openedFile)

page_nums = pdf.getNumPages()
print(page_nums)

page = pdf.getPage(0)
text = page.extractText()

print(text)

and result is:

kernel@kernel-IT:~$ /bin/python3 /home/kernel/Desktop/pdf.py
1

without any error

CodePudding user response：

If you need to extract text from the file, try using pdfbox. It's interface isn't perfect(didn't find a way to get the text without saving a .txt file), but you can use it like so

import pdfbox
import os

def extract_text(doc_path):
    p.extract_text(doc_path)
    txt_path = doc_path[:-3] 'txt'
    with open(txt_path, 'r',encoding= "utf_8_sig") as f:
        text = f.read()
    os.remove(txt_path)
    return text

p = pdfbox.PDFBox()

print(extract_text("path/to/pdf"))

Run pip install python-pdfbox to install it

CodePudding user response：

Nobody can't help me? It's veey important for me