SRIIA TECHNOLOGIES
SCAN STORE RETRIEVE INDEX INTEGRATE ARCHIVE
Tag Archives: outsourcing
Seven Steps Towards A Successful Document Conversion Project
The service of converting paper or microfilm documents to digital format is a commodity in the document conversion world. It seems that anyone can become a service bureau with an inexpensive scanner and rudimentary capture software. The problem is there is really so much more to scanning than meets the eye – and this doesn’t become apparent until you have paid someone to scan a million of your documents just to discover you can only access about 750,000 of them within your document management software. Oh, and this realization happens about a year after you have signed off on the project.
How will you ever know if the bureau actually scanned 100% of your images? How will you know if they delivered 100% of them to you? I was once part of a project where we had partnered with a service bureau to scan land records books from a major US county. During the process, the partner delivered 50,000 medical records images by mistake. Talk about a disaster – this is one of the worst I have ever experienced. Billing throughout the remainder of the project constant struggle. Delivery details such as which images and how many were never accurate. To help ensure a successful backfile project, include some type of pre-project checklist.
The following is a suggested minimum for this check-list:
- Pre-scan inventory
- Pilot process to establish image quality standards
- Indexing nomenclature and detail
- Error rating process (by image, record, index…)
- Batch delivery schedule including durations and volumes
- Reconciliation methodology to original inventory
- Review and error reporting process
10 Questions to Ask Before Starting Your Backfile Scanning Project
Backfile Scanning: Prior to beginning any type of backfile scanning project, you must determine one goal; “Ultimately, how do I need to use these digital images?” If you only need copies of your images on USBs or DVDs, then that is all you need to ask for. But if you actually intend on accessing these documents in their digital format within your document or content management system, then you will need to establish a way to ensure you are receiving what you are expecting.
Good Questions to ask you conversion vendor:
- How are we going to look-up these records digitally?
- The image quality is important to us – must we pay extra for image enhancement.
- Will we be able to do a full-text search within these records?
- How much space will these images take up on our servers?
- What image format do we need?
- Do we need additional software to search for and view the digital records or can we use what we have?
- How will the accuracy be calculated for this project?
- How long with the project take?
- How will we be allowed to review the work?
- 10. What is your guarantee policy?
Preparing for OCR
Optical Character Recognition – Per Wikipedia, OCR is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. Applied to the appropriate document type and format, OCR processing is extremely useful and can save both internal resources and CAPEX along with producing a higher quality product than if done by hand-key entry.
Unfortunately, OCR is not for every project. Skewed text, rough text, heavy noise, lines and other foreign data interfering with a clear and uninterrupted view and scan of text will reduce accuracy.
OCR engines are very linear processes – they look horizontally and perpendicularly across digital images. Any skewing from a 90 degree orientation will negatively affect any OCR engine. Additionally, OCR engines are not magic but very pragmatic. Images must contain familiar text resembling existing alphabetical characters. Anything that distorts standard text will reduce accuracy.
The following are industry accepted steps used to increase OCR accuracy:
Deskew – Software process, using various advanced algorithms, will identify the text orientation and attempt to align the image to a perfect 90 degree.
Noise Reduction – Also known as despeckling – software process that will remove small imperfections, spots, scratches, blotches and random marks from within the white area in a digital image. Removing these imperfections will reduce OCR engine interference and reduce “false positive” reads.
Dilation/Erosion – Text quality is the key to OCR accuracy. These filters can smooth the edges of text by removing pixels that represent rough edges or add pixels to fill missing data with a character.
Line Removal – Speciality software can provide the functionality to remove lines from an image. Removing lines reduces OCR interference.
Red/Blue/Green Dropout – Using the proper settings, color scanners will “not capture” red, blue and green data within an image. Many times, pre-printed forms have the boxes and response areas printed in red, blue or green. This is purposefully done, so that during the scanning process, the box lines and response areas are not captured and thus is less interference with the OCR engine.