| Skip navigation |
||||||||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Automating the production of bibliographic records for MEDLINE2. Project objectives The objectives of this project are to:
3. Project significance There are two principal and obvious reasons why automated data entry is of interest: first, the gradual rise of labor costs; and second, the unrelenting increase in the amount of data that needs to be entered into databases from paper-based information. The vast majority of the hundreds of databases produced in every discipline rely on laborious keyboard entry of bibliographic information from articles in journals, e.g., article title, author names, institutions, abstract, dates, page numbers, etc. Image analysis and understanding techniques provide the basis for the development of automated systems that promise a cheaper alternative to keyboarding, and a more timely availability of bibliographic data for the public. 4. System description To provide context for the subsequent discussions of image analysis research, in this section we first describe the overall system (MARS-2) that serves both as an experimental testbed as well as a practical production tool. Secondly, since the system is database-centered and database-driven, we outline the design of the central database that controls the workflow and which serves as the repository for all data flowing in and out of the many processes. Thirdly, since the optical character recognition system is key to the extraction of text from the document images, we describe the evaluation criteria and test results that pointed to the package selected. 4.1 Overview The MARS-2 system consists of both automated and operator-controlled subsystems as shown in Figure 4.1. The schematic shows automated processes as boxes with thin boundaries, and manual workstations with thick boundaries. The workflow is initiated at the CheckIn stage where a supervisor scans the barcode on a journal issue arriving at the production facility. As mentioned earlier, this barcode number, called the &MRI", is routinely affixed to every journal issue by NLM staff. It therefore serves as a unique key to identify the issue, all the pages scanned in that issue, and indeed the outputs of all processes performed on those page images. The scanning operator captures the first page of every article in the issue, since this page contains the fields we seek to extract automatically. The resulting TIFF images go into a file server and associated data into the MARS database for which the underlying DBMS is Microsoft's SQL Server. The OCR system accesses the TIFF images and produces the corresponding text as well as other data descriptive of the text characters such as bounding boxes, attributes (bold, italic, underlined), confidence level, font style and size, and others. The automatic zoning (Autozone) module then blocks out the contiguous text using features derived from the OCR output data, followed by the automated labeling (Autolabel) module that identifies the zones as the fields of interest (article title, author names, affiliations, abstract). The Autoreformat module then organizes the syntax of the zone contents to adhere to MEDLINE conventions (e.g., author name John A. Smith becomes Smith JA). At this point, two lexicon-enabled modules operate on the data to reduce the burden on the operator performing the final checking and verification of the data: ConfidenceEdit modifies the incorrect confidence levels assigned to the characters by the OCR system, and PatternMatch corrects institutional affiliations whose text is frequently recognized incorrectly by the OCR system. Some data cannot be automatically extracted. The major reason is that they appear in pages other than the scanned first page. Such data is manually entered by a pair of edit operators, a double-key process that ensures a high degree of accuracy. An EditDiff module then correlates these different entries and notes differences. The output of the automated processes and the edit operators is then presented to the reconcile operator who verifies and corrects the text. The Upload module then sends the verified data to the NLM's DCMS (Data Creation Maintenance System) which is accessed by NLM indexers to add MeSH terms and keywords, thereby completing the MEDLINE record. The Admin workstation shown is used by the production supervisor to send a journal issue back to an earlier processing stage in case of errors. Figure 4.1 MARS-2 general schematic
| |||||||||||||||||||||||||||||||||||||||||||||||||
CEB Home | CEB Projects | Related Work | Publications |
Repositories | NHANES | Site Index
URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_2.php
|
||||||||||||||||||||||||||||||||||||||||||||||||||