OCR (or optical character recognition) software “reads” the text in a digital document, provides a searchable text file, and creates an easily navigable experience. In the case of the National Digital Newspaper Program, OCR software has to read the text from microfilm images of 100+ year-old newspapers, which are not ideal circumstances. This software provides researchers the ability to search through millions of pages of digital newspapers, but various limitations mean that researchers must carefully use a blend of digital research and manual scouring to get the most out of their research.
Some OCR Limitations
- Inconsistent/mixed fonts
- Fading text
- Missing characters
- Page bleed-through — in some cases, the ink from the opposing page “bleeds” through the page making it difficult for software to distinguish text.
- Column confusion
- Poor-quality originals — stains, tears, or faded print
- Tight gutters/missing text
NDNP — Digital Microfilm and OCR
In order to keep text searching as consistent as possible across the program, the NDNP has strict guidelines for member institutions to follow. Because many of the limitations are outside of an institution’s control, LC and NEH are flexible and make exceptions to make the program as inclusive as possible. As frustrating as bad OCR text can be, it is still better than absolutely nothing!
Visit the NDNP Website for more information about their guidelines concerning digital microfilm and OCR.