I’ve been using tesseract to read various numbers (up to 99,999.9) in the format below:
It seems to get a proper read about 85% of the time, but I need 100% accuracy.
async function runOCR(url) {
const worker = await Tesseract.createWorker('eng', 1, {
tessedit_pageseg_mode: 13,
config: '--psm 13'
});
(async () => {
await worker.load();
await worker.loadLanguage('eng');
await worker.initialize('eng');
await worker.setParameters({
tessedit_ocr_engine_mode: Tesseract.OEM_TESSERACT_ONLY,
tessedit_char_whitelist: '0123456789,.',
preserve_interword_spaces: '0',
SINGLE_WORD: true,
tessedit_pageseg_mode: Tesseract.SINGLE_WORD,
});
const {
data: { text },
} = await worker.recognize(url);
doSomething(text);
await worker.terminate();
})();
}
The main issue is I don’t know where to set the Page Segmentation Mode (PSM, pageseg). The examples I’m finding are either out of date or in another language.
Here’s a pageseg options list that I found from a C file (https://github.com/tesseract-ocr/tesseract/blob/4.0.0/src/ccstruct/publictypes.h#L163)
PSM_OSD_ONLY, ///< Orientation and script detection only.
PSM_AUTO_OSD, ///< Automatic page segmentation with orientation and
///< script detection. (OSD)
PSM_AUTO_ONLY, ///< Automatic page segmentation, but no OSD, or OCR.
PSM_AUTO, ///< Fully automatic page segmentation, but no OSD.
PSM_SINGLE_COLUMN, ///< Assume a single column of text of variable sizes.
PSM_SINGLE_BLOCK_VERT_TEXT, ///< Assume a single uniform block of vertically
///< aligned text.
PSM_SINGLE_BLOCK, ///< Assume a single uniform block of text. (Default.)
PSM_SINGLE_LINE, ///< Treat the image as a single text line.
PSM_SINGLE_WORD, ///< Treat the image as a single word.
PSM_CIRCLE_WORD, ///< Treat the image as a single word in a circle.
PSM_SINGLE_CHAR, ///< Treat the image as a single character.
PSM_SPARSE_TEXT, ///< Find as much text as possible in no particular order.
PSM_SPARSE_TEXT_OSD, ///< Sparse text with orientation and script det.
PSM_RAW_LINE, ///< Treat the image as a single text line, bypassing
///< hacks that are Tesseract-specific.
How can I better detect the numbers in the image OR how do I set the Page Segmentation Mode / config correctly? (The config changes I’ve been making don’t seem to make a difference in my hit rate)