National Library of Medicine, HTTP://www.nlm.nih.gov Communications Engineering Branch Title Lister Hill National Center for Biomedical Communications, HTTP://www.lhncbc.nlm.nih.gov/
 

CEB Home
CEB Projects
Related Image Processing Work
Publications
Repositories
NHANES
Student Internships Site Index
Turning The Pages Online: http://archive.nlm.nih.gov/proj/ttp/intro.htm
Use MyMorph document conversion tool to make PDF files http://docmorph.nlm.nih.gov/docmorph/
Medical Article Records GROUNDTRUTH (MARG): http://marg.nlm.nih.gov/index2.asp
MD on Tap: http://mdot.nlm.nih.gov/proj/mdot/mdot.php
AnatQuest: http://anatquest.nlm.nih.gov/

page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


Automating the production of bibliographic records for MEDLINE

4.3 OCR system evaluation and selection

A key step in the design of MARS is the selection of the OCR system. This selection was based on a performance comparison of six commercial packages, four of them single-engine systems and Maxsoft-Ocron and Prime Recognition, both multiple-engine voting systems. Testing was done on about 20,000 characters from 15 page images (scanned at 300 dpi) from five biomedical journals indexed in MEDLINE. All five journals were selected because they appeared likely to cause conversion problems, e.g., because of tightly packed text and small sized fonts.

The testing focused primarily on the number of error blocks (a "block" may be either a character or a word, depending on the OCR package) in line with the following error criteria: (1) highlighted correct blocks, i.e., blocks that are highlighted by the OCR system, but whose contents are correct (false alarms); (2) highlighted error blocks, blocks that are highlighted, and whose contents have incorrect characters and correction is required (correctly detected errors); (3) undetected blocks, which have incorrect characters that are not highlighted (undetected errors). The data below shows Prime Recognition superior with respect to all three error criteria.

False alarms Correctly detected errors Undetected errors
Prime Recognition = 168 Prime Recognition = 42 Prime Recognition = 6
Wordscan = 339 Wordscan = 72 Wordscan = 24
TextBridge = 70 TextBridge = 70 TextBridge = 37
Omnipage = 285 Omnipage = 66 Omnipage = 23
Cuneiform = 259 Cuneiform = 53 Cuneiform = 35
Maxsoft-Ocron = n/a Maxsoft-Ocron = n/a Maxsoft-Ocron = 33

Other evaluation factors, important in a practical production system, included: (1) capability to proofread with a displayed bitmap, (2) medical dictionary interface, and (3) accessibility from our application software. Only Prime Recognition fully met all the criteria, the limitations of the other packages noted below.

Proofing with bitmap Medical dictionary interface Application-accessibility
Prime Recognition = Yes Prime Recognition = Yes Prime Recognition = Yes
Wordscan = small bitmap Wordscan = 13 char. limit Wordscan = Yes
TextBridge = need Word/ WP TextBridge = 10K word limit TextBridge = Yes
Omnipage = Yes Omnipage = 5K word limit Omnipage = 23
Cuneiform = Yes Cuneiform = 24 char. limit Cuneiform = No
Maxsoft-Ocron = No Maxsoft-Ocron = Yes Maxsoft-Ocron = Yes

In addition to character codes, the Prime Recognition OCR, in its output, provides rich secondary data, e.g., character coordinates, confidence levels, font size, font attributes and many others, much of which is exploited by downstream processes, as described later in this report.

To incorporate the OCR software into the MARS system, we developed a module, Prime Recognition OCR Daemon (PROD) that consists of a C++ class that acts as a wrapper for the Prime Recognition C API, and also communicates with the MARS database. In addition, it incorporates two other modules: (a) Bounding Box Corrector, software library routines developed in cooperation with scientists at MathSoft, Inc.; and (b) an independent OCR package from ScanSoft. The first is needed to correct the character coordinates from the Prime Recognition OCR system to improve the reliability of our inhouse zone correction algorithm (Section 5). The second provides more reliable initial segmentation than the engines in the Prime Recognition package do for certain journal layout types. For these journals, identified by ISSN in the database, the zones from ScanSoft are used as a starting point for our zone correction algorithm.

PROD is designed for flexibility. For example, we can set it for the number of CPUs active in the OCR server, the number of OCR engines, for the correction of character coordinates and for recording the time duration for OCR processing. Also, PROD may be used to poll the database for journal issues ready to be processed by the OCR, and to begin or stop processing.



page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


    Return to top of page

CEB Home | CEB Projects | Related Work | Publications | Repositories | NHANES | Site Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility
USA.gov | Need a plug-in? | RSS

URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_4.php
Last updated December 06, 2001

Send questions or comments about this site to