SCAN STORE RETRIEVE INDEX INTEGRATE ARCHIVE
Preparing for OCR
Optical Character Recognition – Per Wikipedia, OCR is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. Applied to the appropriate document type and format, OCR processing is extremely useful and can save both internal resources and CAPEX along with producing a higher quality product than if done by hand-key entry.
Unfortunately, OCR is not for every project. Skewed text, rough text, heavy noise, lines and other foreign data interfering with a clear and uninterrupted view and scan of text will reduce accuracy.
OCR engines are very linear processes – they look horizontally and perpendicularly across digital images. Any skewing from a 90 degree orientation will negatively affect any OCR engine. Additionally, OCR engines are not magic but very pragmatic. Images must contain familiar text resembling existing alphabetical characters. Anything that distorts standard text will reduce accuracy.
The following are industry accepted steps used to increase OCR accuracy:
Deskew – Software process, using various advanced algorithms, will identify the text orientation and attempt to align the image to a perfect 90 degree.
Noise Reduction – Also known as despeckling – software process that will remove small imperfections, spots, scratches, blotches and random marks from within the white area in a digital image. Removing these imperfections will reduce OCR engine interference and reduce “false positive” reads.
Dilation/Erosion – Text quality is the key to OCR accuracy. These filters can smooth the edges of text by removing pixels that represent rough edges or add pixels to fill missing data with a character.
Line Removal – Speciality software can provide the functionality to remove lines from an image. Removing lines reduces OCR interference.
Red/Blue/Green Dropout – Using the proper settings, color scanners will “not capture” red, blue and green data within an image. Many times, pre-printed forms have the boxes and response areas printed in red, blue or green. This is purposefully done, so that during the scanning process, the box lines and response areas are not captured and thus is less interference with the OCR engine.