I want to convert web PDF's such as - https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf & many more into a Text without saving them into my PC ,Cause 1000's of such announcemennts come up daily , Hence wanted to convert them to text without saving them on my PC. Any Python Code Solutions to this? Thanks
CodePudding user response:
There is different methods to do this. But the simplest is to download locally the PDF then use one of following Python module to extract text (OCR) :
Here is a simple code example for that (using pdfplumber)
from urllib.request import urlopen
import pdfplumber
url = 'https://archives.nseindia.com/corporate/ICRA_26012022091856_BSER3026012022.pdf'
response = urlopen(url)
file = open("img.pdf", 'wb')
file.write(response.read())
file.close()
try:
pdf = pdfplumber.open('img.pdf')
except:
# Some files are not pdf, these are annexes and we don't want them. Or error reading the pdf (damaged ? )
print(f'Error. Are you sure this is a PDF ?')
continue
#PDF plumber text extraction
page = pdf.pages[0]
text = page.extract_text()
EDIT : My bad, just realised you asked "without saving it to my PC". That being said, I also scrap a lot (1000s aswell) of pdf, but all save them as "img.pdf" so they just keep replacing each other and end up with only 1 pdf file. I do not provide any solution for PDF OCR without saving the file. Sorry for that :'(
