Document Scanning and the Three Legged Stool

Have you been to a “Scanning  Seminar” recently?  You probably walked away believing that the document scanning was the most  import part of any “conversion project”.

But then you visited with a consultant who greatly  undervalued the importance of the scanning with a dismissing statement such as,  “anybody can scan paper (or microfilm)”… He or she then explained that the crucial element of a document scanning project is  the consulting and professional services to implement your project.

But wait; now you visit with a software salesperson. You are informed that buying the proper software  will ensure a successful project no matter what type of scanning and/or  professional services you employ.

Referencing the “Three Legged Stool” analogy, we can see  that if any of these three elements fail to deliver, you will fall right upon  your _ _ _. Experience will tell us that each of these elements is  equally important.  Each is dependent  upon the other to ensure a successful project:

Scanning Service:  Proven Quality Control and Project Tracking methodologies along with proper hardware and software are crucial in the success of your project. Determining the document capture configuration is entirely dependent  upon the type and volume of your source documents. The software functionality  used to do image clean-up is the most important in this selection. If you  intend on OCR or automated forms processing, image quality is key to  success.

Professional Services: These services should set the table for the project. From  elements such as a pre-scan inventory, importing scanned images into your new software, pilot projects, project milestones, determining indexing  nomenclature, network requirements, training and other elements involved in the  over-all project implementation are the nexus to the software and scanning.

Enterprise Document/Content Management Software: Of course  software is always important. Your software selection must meet and exceed your  current needs and provide scalability for the future. Very cliché, but  truthful; by working with both a good consultant and a good software vendor,  you will get more of a 360 degree view of what you will get out of your new  software. Initially, you should access the scanned images in your new  software system in a way similar to that if you were looking for these records  in a standard file cabinet. Moving too many steps passed this may lead to user confusion, a feeling of intimidation and a  lack of user buy-in.

Preparing for OCR

Optical Character Recognition – Per Wikipedia, OCR is the mechanical or electronic translation of scanned images of handwritten, typewritten or printed text into machine-encoded text. Applied to the appropriate document type and format, OCR processing is extremely useful and can save both internal resources and CAPEX along with producing a higher quality product than if done by hand-key entry.

Unfortunately, OCR is not for every project. Skewed text, rough text, heavy noise, lines and other foreign data interfering with a clear and uninterrupted view and scan of text will reduce accuracy.

OCR engines are very linear processes – they look horizontally and perpendicularly across digital images. Any skewing from a 90 degree orientation will negatively affect any OCR engine. Additionally, OCR engines are not magic but very pragmatic. Images must contain familiar text resembling existing alphabetical characters. Anything that distorts standard text will reduce accuracy.

The following are industry accepted steps used to increase OCR accuracy:

Deskew – Software process, using various advanced algorithms, will identify the text orientation and attempt to align the image to a perfect 90 degree.

Noise Reduction – Also known as despeckling – software process that will remove small imperfections, spots, scratches, blotches and random marks from within the white area in a digital image. Removing these imperfections will reduce OCR engine interference and reduce “false positive” reads.

Dilation/Erosion – Text quality is the key to OCR accuracy. These filters can smooth the edges of text by removing pixels that represent rough edges or add pixels to fill missing data with a character.

Line Removal –  Speciality software can provide the functionality to remove lines from an image. Removing lines reduces OCR interference.

Red/Blue/Green Dropout – Using the proper settings, color scanners will “not capture” red, blue and green data within an image. Many times, pre-printed forms have the boxes and response areas printed in red, blue or green. This is purposefully done, so that during the scanning process, the box lines and response areas are not captured and thus is less interference with the OCR engine.

Microfilm vs. Digital for Archival Storage

The discussion on the viability of using microfilm or digital for long-term archiving rears its ugly head on a regular basis in courtrooms, boardrooms and offices for both government and private institutions alike.

From a legal perspective, microfilm is a supported format. The Best Evidence Rule (Federal Business Records Act, Uniform Photographic Copies of Business and Public Records as Evidence Act) states that these statutes permit the admissibility of any record which has  been “kept in the regular course of business and copied or reproduced by  … any photographic, photostatic, microfilm, microcard, miniature photographic or other process which accurately reproduces or forms a durable medium for  reproducing the original.” Accordingly, the reproduction is as admissible  as the original. The process of recording information optically clearly falls within the law’s language of “other process which accurately reproduces or forms a durable medium for reproducing the original.”

All US states have published document retention and library standards and micrographics adhere to just about every  state’s standards for long-term document archiving. New York State, California, Texas, Indiana, Arizona, Louisiana and Florida Archives, just to name a few,  promote microfilm as a viable and practical medium for preserving the state’s  history.

Now, ask yourself this question; Let’s say  you have been left a trust in the value of $100,000,000.00. You must wait 20 years to  have access to these funds. The money will be accessed via a 1,000 character  code found on 200 separate documents (files). You will be provided with these  documents on a USB drive, a CD, a DVD or a roll of microfilm. Which media would  you choose? Backwards compatibility will always be a serious concern.  What platform created the electronic copy? Was it Windows? Will  this format be supported in 20 years? Will you need to do some type of  conversion to your 20-year-old data to have access to your code? If you lose  one single image, you will not be able to access these funds. Now, instead of a code for a trust, look at  those documents as proof of purchase/ownership, human resource records, certification  documents, medical research history, etc…

Advantages of Microfilm for Long Term Archiving:

  1. Properly filmed and processed microfilm on a polyester base has an anticipated life expectancy (LE 500) of 500 years.
  2. All you need to view film is a light source.
  3. Individual pages cannot be pulled or lost.
  4. Original rolls cannot be edited.
  5. A single roll of 16mm microfilm can hold over 8,000 images – that is almost an entire 3 drawer file cabinet of documents.
