Automating data entry into MEDLINE

George R. Thoma
National Library of Medicine
Bethesda, Maryland 20894


Abstract

Data entry for the thousands of bibliographic databases around the world from information in journal articles continues to be heavily manual. At the National Library of Medicine (NLM) we are automating the production of bibliographic records for MEDLINER, NLM's premier database used by clinicians and researchers worldwide. As a first step, the Lister Hill National Center for Biomedical Communications, an R&D division of the library, has developed a system called MARS (for Medical Article Record System) that involves scanning and converting (by OCR) the abstracts that appear in journal articles, while keyboarding the remaining fields (e.g., article title, authors, affiliations, etc). We focus on the abstract first because this is the largest field in a typical record, amounting to a maximum of 4000 characters. While this system is in production, we are designing a second generation system to automatically extract these other fields as well. This future system will employ scanning and OCR as well, in addition to modules that automatically zone the scanned pages, identify the zones as particular fields, and reformat the field syntax to adhere to MEDLINE conventions. This talk describes the first generation system currently used for production, and the ongoing work toward the design of the second generation system.

The initial system consists of multiple workstations of three types: scanners, workstations for manual entry (keyboarding) and workstations for reconciling (proofing and correcting), in addition to three unattended servers: a network file server, an OCR server and one to perform various file matches.

At each scan workstation, the operator first barcode-scans an ID number that appears uniquely on each issue. The operator then scans the first page of each article on which the abstract appears, and manually zones the title and abstract whose bitmapped TIFF files are sent to the network server.The OCR server retrieves these TIFF files from the network server, and produces text files of the abstract and title. The network server maintains directories in which the scanned TIFF images, the abstract text files and the citation files are all kept until they are acted upon. The barcoded ID number scanned initially serves as a directory name and all TIFF images and OCR data for all the articles in that issue are linked to that number.

Concurrently or at any time, the keyboarder keys in the fields (other than the abstract) for each article, and a second operator repeats this process for the same articles. Double keying is found to substantially improve accuracy, thereby reducing the burden on the reconcile operators. The two manual entries are compared automatically to produce a "citation difference " file highlighting inconsistencies. Then the title field from this citation difference file is automatically matched with the OCR'ed title, thereby linking the keyed data with the scanned abstract.

Meanwhile, the abstract text from the OCR is checked by a spellcheck module based on medical lexicons and heuristic rules to reduce the number of correct words that were highlighted, to reduce the burden on the reconcile operator. At this point, all the fields entered by keyboard and out of the OCR system are available for validation and proofing by the reconcile operators. Following this step, the completed record file is FTP'ed to the NLM mainframe computer, and later accessed by indexers who add appropriate descriptive information such as Medical Subject Headings, thereby completing the bibliographic record to be added to MEDLINE.

The ongoing work in developing the second generation system consists of developing algorithms to detect page zones (page segmentation), automatically label these zones by field name (article title, author, affiliation, abstract), and then automatically reformat the zone text syntax. The system relies on a database to keep track of the workflow as well as serve as a repository for data extracted from the scanned page to be used by subsequent processes.