National Library of Medicine, HTTP://www.nlm.nih.gov Communications Engineering Branch Title Lister Hill National Center for Biomedical Communications, HTTP://www.lhncbc.nlm.nih.gov/
 

CEB Home
CEB Projects
Related Image Processing Work
Publications
Repositories
NHANES
Student Internships Site Index
Turning The Pages Online: http://archive.nlm.nih.gov/proj/ttp/intro.htm
Use MyMorph document conversion tool to make PDF files http://docmorph.nlm.nih.gov/docmorph/
Medical Article Records GROUNDTRUTH (MARG): http://marg.nlm.nih.gov/index2.asp
MD on Tap: http://mdot.nlm.nih.gov/proj/mdot/mdot.php
AnatQuest: http://anatquest.nlm.nih.gov/

page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


Automating the production of bibliographic records for MEDLINE

12.3 Alternative method for text verification

The conventional approach to verifying the text output of any OCR system, or as in the case of MARS the output of a succession of automated processes, is to present the text in the same sequence as it appears on the printed page, and to highlight the low confidence characters (in color) in the text words. Then, as in our reconcile workstation, the operator can "tab" quickly from one suspected character to the next and make the necessary corrections. This conventional approach has some drawbacks. For example, the operator must detect the suspected character surrounded by a mass of correct text. Also, the text must be corrected as encountered, thereby breaking the rhythm of identifying incorrect characters.

An alternative method is proposed that may prove to improve operator productivity. Called Carpet Character Certification and Correction, or the "Carpet" method, it involves grouping like characters (drawn from a number of pages or journal issues at the same time) and displaying them in groups in a single window, as shown in Figure 12.3.1. Each character appears in its "edit box." Only low confidence A - Z and 0 - 9 characters, of the same type, would be displayed in groups. The example shows a set of characters in the edit boxes, mostly e's, some of them a misreading of an s or an E as shown in the corresponding bitmapped images right above the edit boxes. Since context is important to detect poorly captured character shapes, the system must provide the display of the image fragment (a word or phrase) that provides the context in which the (presumably) incorrect character appears. Such context will particularly help distinguish letters or numbers that appear similar, e.g., 1, I or 0,O.

The GUI for the Carpet system must possess the following functions:

  1. An Automatic Context display of the TIFF image must be available when any of the edit boxes is focused on.
  2. With a mouse click on an edit box, or using the F1 thru F9 keys, the operator should be able to correct the character. In Figure 12.3.1, the bitmap of each low confidence character appears above the edit text boxes. If the operator is unclear as to what the character is, he/she may type in a '?' that invokes the image fragment to provide the necessary context.
  3. To continue, the operator should be able to select Next>> and load the next batch of 9 characters, or select <<Previous for a second look at the previous 9. By selecting Next>>, all characters are set to high confidence.
  4. Reset all characters displayed to low confidence, in case of an accidental clicking of the Next>> button.
  5. Click OK to stop processing.

The Carpet system is to be implemented using Visual C++ with the Kodak Image libraries. Following the software development, we intend to conduct a performance study using this system for reconciling, and measure the residual error rate and the time taken for correcting and verifying all the characters from a complete journal issue at a time. Should the accuracy and time saved prove to be an improvement over the current verification method, this module will be incorporated in the reconcile workstation software.


Figure 12.3.1 GUI for Carpet Character Certification and Correction



page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   




    Return to top of page

CEB Home | CEB Projects | Related Work | Publications | Repositories | NHANES | Site Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility
USA.gov | Need a plug-in? | RSS

URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_18.php
Last updated December 06, 2001

Send questions or comments about this site to