I am developing a simple JavaScript code to extract text
from image
. For that I am using image processing library like Tesseract.
But I found that, Tesseract
is not 100% accurate. ( or may be I dont know how to use it correctly)
For example, after converting image text to array of strings and scanning every string one by one I am getting following strings which are not same.
Age + 67 Gender : Female Age : 45 Gender : Female Age + 45 Gender :
MaleAge : 44 Gender : Male Age 36 Gender : Female Age : 56 Gender : Male
Age +63 Gender : Male Age : 62 Gender : Female Age : 37 Gender : Male
I splited the string on the basis of +
and space
like this
const ageAndGenderArray = line.split(” “) || line.split(“+”);
and I got following output.
[‘Age’, '+', '67'
, ‘Gender’, ‘:’, ‘Female’, ‘Age’, ‘:’, ’45’, ‘Gender’, ‘:’, ‘Female’, ‘Age’, '+'
, ’45’, ‘Gender’, ‘:’, ‘Male’]
[‘Age’, ‘:’, ’44’, ‘Gender’, ‘:’, ‘Male’, ‘Age’, ’36’, ‘Gender’, ‘:’, ‘Female’, ‘Age’, ‘:’, ’56’, ‘Gender’, ':'
, ‘Male’]
[‘Age’, '+63'
, ‘Gender’, ‘:’, ‘Male’, ‘Age’, ‘:’, ’62’, ‘Gender’, ‘:’, ‘Female’, ‘Age’, ‘:’, ’37’, ‘Gender’, ‘:’, ‘Male’]
If you observe, all the input strings are not exactly same. Some are having
Age + 67
and some are having Age +63
. Somewhere there is +
and somewhere there is :
. So I am not able to extract a text out of it.
I am expecting output as like this :
63 Male
62 Female
37 Male
So how to parse such diverse string ?
My code :
const processImage = () => {
Tesseract.recognize(file, "eng", { logger: (m) => console.log(m) }).then(
({ data: { text } }) => {
console.log(text);
const parsedCandidates = parseOCRResult(text);
setCandidates(parsedCandidates);
}
);
console.log(file);
};
const parseOCRResult = (text) => {
// parsing logic of strings
}