I have several types of images that I need to extract text from. I can manually classify the images into 3 categories based on the noise on the background:
- Images with no noise.
- Images with some light noise in the background.
- Heavy noise in the background.

For the category 1 images, I could apply OCR’ing fine without problems. → basic case.
For the category 2 images and some of the category 3 images, I could manage to extract the texts by applying the following methods:
- Grayscale, Gaussian blur, Otsu’s threshold
- Morph open to remove noise and invert the image → then perform text extraction.
For the OCR’ing task, one removing noise method is obviously not working for all images. So, Is there any method for classifying the level background noise of the images?
Please all suggestions are welcome. Thanks in advance.
Updated(2022 Jan 21):
With the answer that I got from @B200011011 in 
From Category 2 and 3 images. Examples:
Here is the code which I am using:
from imutils import paths
from skimage import exposure
import math
for imagePath in paths.list_images("/content"):
txt_block_img = cv2.imread(imagePath)
img = cv2.imread(imagePath, cv2.IMREAD_GRAYSCALE)
cv2_imshow(img)
img_pixel_count = img.shape[0] * img.shape[1]
h = np.array(exposure.histogram(img, nbins=256))
if len(h[0]) == 256:
bw_count = h[0][0] h[0][255]
other_count = img_pixel_count - bw_count
bw_percentage = (bw_count * 100.0) / img_pixel_count
other_percentage = (other_count * 100.0) / img_pixel_count
# print('BW PIXEL PERCENTAGE: ', bw_percentage)
print('OTHER PIXEL PERCENTAGE: ', math.ceil(other_percentage))
differentiate_threshold = 30.0
if other_percentage > differentiate_threshold:
print('TYPE 2 or TYPE 3')
else:
print('TYPE 1: BLACK AND WHITE')
else:
print("the image has no black color")
I could get the following results:
However, this solution is not completed, since I can not find a good threshold to separate Type 2 and Type 3 images. So, are there any image processing methods that I could try to separate type 2 and type 3 images?
CodePudding user response:






