I'm try to convert the data in an image to a data frame in R using tesseract, but have run into a problem, perhaps due to my use of regular expressions.
library(magick)
library(tesseract)
team_img <- image_read("measuring.png")
team_mgk <- team_img %>%
image_resize('2000x') %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr()
cat(team_mgk)
text1_a <- gsub('[[:punct:]]', '', team_mgk)
read.table(text=team_mgk,
col.names=c('Time_factor', 'Tree_#', 'Species',
'Fragment','Linear_Extension', 'Colur'))
# Error
# Error in scan(file = file, what = what, sep = sep, quote = quote,
# dec = dec, :
# line 1 did not have 6 elements
The idea is learn how to use OCR to read in a data frame. The image is as follows:
CodePudding user response:
You were very close. The OCR did a good job, with only a single underscore missing in the last row which causes read.table to throw an error. This can be fixed with simple replacement using sub.
Since the image is online now at https://i.stack.imgur.com/V9lWV.png after being uploaded in your question, we can create a fully reproducible example.
library(dplyr)
library(magick)
library(tesseract)
df <- "https://i.stack.imgur.com/V9lWV.png" %>%
image_read() %>%
image_resize('2000x') %>%
image_convert(type = 'Grayscale') %>%
image_trim(fuzz = 40) %>%
image_write(format = 'png', density = '300x300') %>%
tesseract::ocr() %>%
strsplit('\n') %>%
getElement(1) %>%
`[`(-1) %>%
{sub('Time 2', 'Time_2', .)} %>%
{read.table(text = .)} %>%
setNames(c('Time_factor', 'Tree #', 'Species', 'Fragment',
'Linear Extension(mm)', 'Colour'))
Resulting in
df
#> Time_factor Tree # Species Fragment Linear Extension(mm) Colour
#> 1 Time_O 31 A.tenius 12A 49.50 Brown
#> 2 Time_1 31 A.tenius 12A 56.72 Brown
#> 3 Time_2 31 A.tenius 12A 74.38 Brown
#> 4 Time_O 31 A.tenius 12B 58.66 Brown
#> 5 Time_1 31 A.tenius 12B 78.45 Brown
#> 6 Time_2 31 A.tenius 12B 94.37 Brown
#> 7 Time_O 31 A.tenius 12C 55.97 Brown
#> 8 Time_1 31 A.tenius 12C 90.12 Brown
#> 9 Time_2 31 A.tenius 12C 121.61 Brown
#> 10 Time_O 31 A.tenius 12D 70.19 Brown
#> 11 Time_1 31 A.tenius 12D 91.82 Brown
#> 12 Time_2 31 A.tenius 12D 115.57 Brown
#> 13 Time_O 34 A.tenius 3B 60.10 Yellow
#> 14 Time_1 34 A.tenius 3B 79.00 Yellow
#> 15 Time_2 34 A.tenius 3B 103.82 Yellow
#> 16 Time_O 34 A.tenius 3C 48.18 Yellow
#> 17 Time_1 34 A.tenius 3C 58.70 Yellow
#> 18 Time_2 34 A.tenius 3C 99.03 Yellow
#> 19 Time_O 34 A.tenius 3D 66.12 Yellow
#> 20 Time_1 34 A.tenius 3D 84.05 Yellow
#> 21 Time_2 34 A.tenius 3D 114.38 Yellow
#> 22 Time_O 34 A.tenius 3E 68.94 Yellow
#> 23 Time_1 34 A.tenius 3E 92.30 Yellow
#> 24 Time_2 34 A.tenius 3E 109.05 Yellow
#> 25 Time_O 34 A.tenius 4A 46.20 Blue
#> 26 Time_1 34 A.tenius 4A 67.00 Blue
#> 27 Time_2 34 A.tenius 4A 127.48 Blue
#> 28 Time_O 34 A.tenius 4B 87.19 Blue
#> 29 Time_1 34 A.tenius 4B 109.18 Blue
#> 30 Time_2 34 A.tenius 4B 109.71 Blue
#> 31 Time_O 34 A.tenius 4C 77.26 Blue
#> 32 Time_1 34 A.tenius 4C 123.57 Blue
#> 33 Time_2 34 A.tenius 4C 135.59 Blue
#> 34 Time_O 34 A.tenius 4D 60.01 Blue
#> 35 Time_1 34 A.tenius 4D 80.32 Blue
#> 36 Time_2 34 A.tenius 4D 101.75 Blue
Created on 2022-08-30 with reprex v2.0.2

