National Library of Medicine, HTTP://www.nlm.nih.gov Communications Engineering Branch Title Lister Hill National Center for Biomedical Communications, HTTP://www.lhncbc.nlm.nih.gov/
 

CEB Home
CEB Projects
Related Image Processing Work
Publications
Repositories
NHANES
Student Internships Site Index
Turning The Pages Online: http://archive.nlm.nih.gov/proj/ttp/intro.htm
Use MyMorph document conversion tool to make PDF files http://docmorph.nlm.nih.gov/docmorph/
Medical Article Records GROUNDTRUTH (MARG): http://marg.nlm.nih.gov/index2.asp
MD on Tap: http://mdot.nlm.nih.gov/proj/mdot/mdot.php
AnatQuest: http://anatquest.nlm.nih.gov/

Medical Article Record System (MARS)

Data entry for the thousands of bibliographic databases around the world from information in journal articles continues to be heavily manual. At the National Library of Medicine (NLM) we are automating the production of bibliographic records for MEDLINE, NLM's premier database used by clinicians and researchers worldwide. As a first step we have developed a system called MARS (for Medical Article Record System) that involves scanning and converting by optical character recognition (OCR) the abstracts that appear in journal articles, while keyboarding the remaining fields (e.g., article title, authors, affiliations, etc). This system has been in production since 1996 and employs a team of professionals to process 600 articles daily.

An example of 3 different journal article layout types.A second generation system is now being designed which automatically extracts the remaining fields. This system employs scanning and OCR as well, in addition to modules that automatically zone the scanned pages, identify the zones as particular fields, and reformat the field syntax to adhere to MEDLINE conventions. The work in developing the second generation system consists of developing algorithms to detect page zones (page segmentation), automatically labeling these zones by field name (article title, author, affiliation, abstract), and then automatically reformatting the zone text syntax. The system relies on a database to keep track of the workflow as well as serve as a repository for data extracted from the scanned page to be used by subsequent processes.

Thoma GR. Automating the production of bibliographic records for MEDLINE. (HTML)    (MS Word file)   (PDF). Internal R&D report, CEB, LHNCBC, NLM; September 2001; 92.




    Return to top of page

CEB Home | CEB Projects | Related Work | Publications | Repositories | NHANES | Site Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility
USA.gov | Need a plug-in? | RSS

URL: http://archive.nlm.nih.gov/proj/mars/mars.php
Last updated August 02, 2006

Send questions or comments about this site to