SCAN STORE RETRIEVE INDEX INTEGRATE ARCHIVE
Monthly Archives: February 2011
Have you been to a “Scanning Seminar” recently? You probably walked away believing that the document scanning was the most import part of any “conversion project”.
But then you visited with a consultant who greatly undervalued the importance of the scanning with a dismissing statement such as, “anybody can scan paper (or microfilm)”… He or she then explained that the crucial element of a document scanning project is the consulting and professional services to implement your project.
But wait; now you visit with a software salesperson. You are informed that buying the proper software will ensure a successful project no matter what type of scanning and/or professional services you employ.
Referencing the “Three Legged Stool” analogy, we can see that if any of these three elements fail to deliver, you will fall right upon your _ _ _. Experience will tell us that each of these elements is equally important. Each is dependent upon the other to ensure a successful project:
Scanning Service: Proven Quality Control and Project Tracking methodologies along with proper hardware and software are crucial in the success of your project. Determining the document capture configuration is entirely dependent upon the type and volume of your source documents. The software functionality used to do image clean-up is the most important in this selection. If you intend on OCR or automated forms processing, image quality is key to success.
Professional Services: These services should set the table for the project. From elements such as a pre-scan inventory, importing scanned images into your new software, pilot projects, project milestones, determining indexing nomenclature, network requirements, training and other elements involved in the over-all project implementation are the nexus to the software and scanning.
Enterprise Document/Content Management Software: Of course software is always important. Your software selection must meet and exceed your current needs and provide scalability for the future. Very cliché, but truthful; by working with both a good consultant and a good software vendor, you will get more of a 360 degree view of what you will get out of your new software. Initially, you should access the scanned images in your new software system in a way similar to that if you were looking for these records in a standard file cabinet. Moving too many steps passed this may lead to user confusion, a feeling of intimidation and a lack of user buy-in.
Optical Character Recognition – Per Wikipedia, OCR is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. Applied to the appropriate document type and format, OCR processing is extremely useful and can save both internal resources and CAPEX along with producing a higher quality product than if done by hand-key entry.
Unfortunately, OCR is not for every project. Skewed text, rough text, heavy noise, lines and other foreign data interfering with a clear and uninterrupted view and scan of text will reduce accuracy.
OCR engines are very linear processes – they look horizontally and perpendicularly across digital images. Any skewing from a 90 degree orientation will negatively affect any OCR engine. Additionally, OCR engines are not magic but very pragmatic. Images must contain familiar text resembling existing alphabetical characters. Anything that distorts standard text will reduce accuracy.
The following are industry accepted steps used to increase OCR accuracy:
Deskew – Software process, using various advanced algorithms, will identify the text orientation and attempt to align the image to a perfect 90 degree.
Noise Reduction – Also known as despeckling – software process that will remove small imperfections, spots, scratches, blotches and random marks from within the white area in a digital image. Removing these imperfections will reduce OCR engine interference and reduce “false positive” reads.
Dilation/Erosion – Text quality is the key to OCR accuracy. These filters can smooth the edges of text by removing pixels that represent rough edges or add pixels to fill missing data with a character.
Line Removal – Speciality software can provide the functionality to remove lines from an image. Removing lines reduces OCR interference.
Red/Blue/Green Dropout – Using the proper settings, color scanners will “not capture” red, blue and green data within an image. Many times, pre-printed forms have the boxes and response areas printed in red, blue or green. This is purposefully done, so that during the scanning process, the box lines and response areas are not captured and thus is less interference with the OCR engine.