| Skip navigation |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Automating the production of bibliographic records for MEDLINE4.3 OCR system evaluation and selection A key step in the design of MARS is the selection of the OCR system. This selection was based on a performance comparison of six commercial packages, four of them single-engine systems and Maxsoft-Ocron and Prime Recognition, both multiple-engine voting systems. Testing was done on about 20,000 characters from 15 page images (scanned at 300 dpi) from five biomedical journals indexed in MEDLINE. All five journals were selected because they appeared likely to cause conversion problems, e.g., because of tightly packed text and small sized fonts. The testing focused primarily on the number of error blocks (a "block" may be either a character or a word, depending on the OCR package) in line with the following error criteria: (1) highlighted correct blocks, i.e., blocks that are highlighted by the OCR system, but whose contents are correct (false alarms); (2) highlighted error blocks, blocks that are highlighted, and whose contents have incorrect characters and correction is required (correctly detected errors); (3) undetected blocks, which have incorrect characters that are not highlighted (undetected errors). The data below shows Prime Recognition superior with respect to all three error criteria.
Other evaluation factors, important in a practical production system, included: (1) capability to proofread with a displayed bitmap, (2) medical dictionary interface, and (3) accessibility from our application software. Only Prime Recognition fully met all the criteria, the limitations of the other packages noted below.
In addition to character codes, the Prime Recognition OCR, in its output, provides rich secondary data, e.g., character coordinates, confidence levels, font size, font attributes and many others, much of which is exploited by downstream processes, as described later in this report. To incorporate the OCR software into the MARS system, we developed a module, Prime Recognition OCR Daemon (PROD) that consists of a C++ class that acts as a wrapper for the Prime Recognition C API, and also communicates with the MARS database. In addition, it incorporates two other modules: (a) Bounding Box Corrector, software library routines developed in cooperation with scientists at MathSoft, Inc.; and (b) an independent OCR package from ScanSoft. The first is needed to correct the character coordinates from the Prime Recognition OCR system to improve the reliability of our inhouse zone correction algorithm (Section 5). The second provides more reliable initial segmentation than the engines in the Prime Recognition package do for certain journal layout types. For these journals, identified by ISSN in the database, the zones from ScanSoft are used as a starting point for our zone correction algorithm. PROD is designed for flexibility. For example, we can set it for the number of CPUs active in the OCR server, the number of OCR engines, for the correction of character coordinates and for recording the time duration for OCR processing. Also, PROD may be used to poll the database for journal issues ready to be processed by the OCR, and to begin or stop processing.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CEB Home | CEB Projects | Related Work | Publications |
Repositories | NHANES | Site Index
URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_4.php
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||