I have a dataset of images and I want to filter out all images that contain text (ASCII chars). For example, I have the following cute image of a dog:
As you can see, on right bottom corner there is a text "MAY 18 2003" so it should be filtered out.
After some research, I came across with tesseract OCR. In python I have the following code:
# Attempt 1
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
print(text)
# Attempt 2
import unidecode
img = Image.open('n02086240_1681.jpg')
text = pytesseract.image_to_string(img)
text = unidecode.unidecode(text)
print(text)
# Attempt 3
import string
char_whitelist = string.digits
char_whitelist = string.ascii_lowercase
char_whitelist = string.ascii_uppercase
text = pytesseract.image_to_string(img,lang='eng',
config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')
print(text)
None of them detected the string (prints whitespaces). How can I detect it?
CodePudding user response:
you should prepare the image for the OCR.
for example, for this image I would do the following:
convert it to Black & White image with threshold that make the text visible (for this image it is 130)

now try tesseract OCR
CodePudding user response:
You can use Easy-OCR instead of pytesseract to get directly this output
Kay 10 2003
and as your goal is just to detect ASCII, you don't care about the accurate characters because you just want to filter the images which contain them.
#!/usr/bin/python3
# -*- coding: utf-8 -*-
import cv2
import easyocr
path = ""
img = cv2.imread(path "input.jpg")
# Now apply the Easy-OCR
reader = easyocr.Reader(['en'])
output = reader.readtext(img)
for i in range(len(output)):
print(output[i][-2])


