National Library of Medicine, HTTP://www.nlm.nih.gov Communications Engineering Branch Title Lister Hill National Center for Biomedical Communications, HTTP://www.lhncbc.nlm.nih.gov/
 

CEB Home
CEB Projects
Related Image Processing Work
Publications
Repositories
NHANES
Student Internships Site Index
Turning The Pages Online: http://archive.nlm.nih.gov/proj/ttp/intro.htm
Use MyMorph document conversion tool to make PDF files http://docmorph.nlm.nih.gov/docmorph/
Medical Article Records GROUNDTRUTH (MARG): http://marg.nlm.nih.gov/index2.asp
MD on Tap: http://mdot.nlm.nih.gov/proj/mdot/mdot.php
AnatQuest: http://anatquest.nlm.nih.gov/

page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


Automating the production of bibliographic records for MEDLINE

8.2.5 Implementation

Word matching for low confidence words in the affiliation field was implemented through two software development efforts42. A separate console program called PatternMatch was developed to automatically parse low confidence OCR words from the identified affiliation field, submit the words to the cascade matching algorithm for possible correct words, and, if any words are returned, insert those words into the affiliation text following the original OCR word and specially tagged for processing by the Reconcile workstation. PatternMatch was developed in C++ in the Microsoft Visual Studio development environment. In addition to the MARS database for input and output records, PatternMatch requires three files: the large and small lexicons and the character substitution frequency matrix used by Probability Matching.

Additional software was developed for reconcile to support the interpretation of the special tags placed in the affiliation text by PatternMatch and the display of the word choices to the reconcile operator. An example of the reconcile application screen is shown in Figure 8.2.1. In this instance, the original OCR output word containing low confidence characters, seen in the lower half of the figure, is UniversiO. The upper half of the screen displays the scanned image with a red box around the image corresponding to the word highlighted in the lower half of the screen. In this example the word matching process found 10 words that could possibly match UniversiO. These are presented to the operator in a drop down list. The first word in the list (UniversiO) is the original OCR word, the default highlighted word (University) is the highest ranked word match, the third word in the list (Universi) is the second highest ranked word match, and so on. The reconcile operator has the option to hit escape and leave the original word highlighted, hit return to substitute the original OCR word with the highest ranked word, or select any of the words in the list using the mouse or keyboard and hit return. The ease of selection is relative to the need for correction: selecting the original OCR word or the first candidate word in the match list is accomplished with a single keystroke. These cases account for approximately 90% of the words containing low confidence characters.

PatternMatch was placed in operation in December 2000, and the reconcile software to support word selection was introduced in February 2001. In the first four months of operation, operators selected the first match choice 80.8% percent of the time, selected one of the other match choices 8.8% of the time, selected no word from the list 6.7% of the time, and the original OCR word 3.7% of the time.

Screen capture of the split screen of the PatternMatch reconcile software recognizing the word University.
Figure 8.2.1 PatternMatch output for reconcile operator

Two modifications to the subsystem for affiliation word matching were implemented in the summer of 2001. A new compilation of affiliation words was generated from a larger set of verified MARS records to generate more complete lexicons for the PatternMatch program. In addition, PatternMatch was modified to more accurately handle words containing diacritics.

9 Operator workstation design

In this section we describe the three principal types of operator workstations, for scanning, editing and reconciling. In all cases, off-the-shelf hardware is used.

9.1 Scan workstation

The scanners in production are mid-range devices manufactured by Fujitsu or Ricoh. These devices are controlled by an inhouse application software called Scan. The primary task for this software is to enable an operator to scan a page of an article to produce a TIFF image, and insert, delete or replace a page image. This software also allows the operator to initiate the workflow for a journal issue (i.e., by entering the MRI number to identify the journal issue to be processed) in case the first stage in MARS, CheckIn, fails or a supervisor is unable to initiate that process for any reason. This is a feature that contributes to overall system reliability. In addition, Scan requires an operator to check the quality of the image resulting from scanning; the operator may zoom into any part of the image to ensure that the page has been scanned correctly. Finally, in case a journal is sent back from downstream processes for rescanning, the Scan software identifies the pages to be rescanned for the operator.

The Scan software is written in C++ and compiled in Microsoft Visual C++ (6.0). The GUI design is based on the AppWizard in MFC, using the option of Single Document Interface (SDI). The parallel process is not required since the operator works on only one page in a journal issue at a time.

The software uses ActiveX to control the scanner through a Kofax controller. The TIFF images are displayed, magnified, rotated and scaled through another ActiveX control provided by Eastman Kodak Image software. Communications with the MARS database are accomplished through RogueWave functions. These are shown in the schematic below.

Figure 9.1.1 Scan software schematic

The Scan program implements real-time communications between the scanner and the MARS database. Should this communication fail, Scan alerts the operator immediately. Scan creates records in the WIP, Page and ProcessTime tables in the database. In the WIP table, it records the journal issue identification (the MRI number barcode scanned in by the operator), time the record is archived, time the journal issue is scanned, location of the TIFF images, total number of images in the issue, the operator ID, the type of scanner, and other data. In the Page table, it records a unique number identifying each page (PageID), scan density (dpi), the height and width of the page in pixels, and whether the scanned page is the first or second page of the article (the second page is scanned only if the abstract continues on to this page). In the ProcessTime table, it records the start and end time for scanning a page, and whether scanning is mouse/keyboard driven or speech controlled.

The Scan software has a quality control (QC) function requiring the operator to view the image before exiting the application. To give the operator a quick indication of quality, we have provided a skew detection capability, skew being a factor in poor image quality: the operator is alerted if the skew exceeds a preset threshold, as shown in Figure 9.1.2.

Figure 9.1.2 An alert to the operator (left); the actual image (right).
Screen shot of the message box that alerts the operator to scanned image skew. The incorrectly scanned article image.

The workstation using the Scan software may be controlled conventionally by mouse/keyboard as well as by speech recognition. The design of the speech-enabled workstation, and the tradeoffs considered among different speech recognition approaches, are reported in the literature.58



page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   




    Return to top of page

CEB Home | CEB Projects | Related Work | Publications | Repositories | NHANES | Site Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility
USA.gov | Need a plug-in? | RSS

URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_11.php
Last updated December 06, 2001

Send questions or comments about this site to