Home Projects Publications Presentations Repositories Photo Gallery Career Staff Favorites
  • MyDelivery
  • Turning The Pages Online
  • MyMorph
  • Medical Article Records GROUNDTRUTH (MARG)
  • MD on Tap
  • AnatQuest
Links to Feeds:
PublicationsRSS  RSS
CEB NewsRSS  RSS

Last updated: June 18, 2008

CEB Projects

Print this Print this  E-mail this E-mail this


page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   



Automating the production of bibliographic records for MEDLINE


4.2 Database considerations

The MARS-2 system is database-centered, and all data pertaining to its operation is stored in an RDBMS, entered by certain processes and retrieved by others. The RDBMS chosen for the MARS database is Microsoft SQL Server 6.5 running under the Windows NT 4.0 Server operating system. This RDBMS is scheduled for a major upgrade to MS SQL Server 2000 Enterprise Edition under Windows 2000 Advanced Server with the Clustering / High Availability functionality. This upgrade will provide the following advantages: a row-level locking scheme that minimizes deadlocks, cascading referential integrity constraints that simplify the implementation of business logic in applications, and functionality that enables XML to be used to retrieve and modify values in the database. Other features of this upgraded system are its availability of data types to handle Unicode (if this is considered useful in the future), and user-friendly GUI-based tools for easier maintenance of the database system.

The database-driven workflow in MARS is controlled by a module called the Work Distribution Manager (WDM) that uses a set of database tables to determine when processes (e.g., automated zoning, labeling, reformatting, etc.) should act on image and text data, when the processes are completed, and which process must occur next in the workflow. WDM resides within the database system, and relies on stored procedures and scheduler to achieve its functionality.

The MARS database consists of sixty two tables, some of which are outlined in this section and shown in the figures. The WorkInProgress table (WIP) serves as a central hub containing common information for each database record (set of data related to a single journal issue) and keeps track of whether the processing of data for a particular journal issue is incomplete, has been completed, or is waiting to be archived. WIP uses the MRI (journal issue identification number) as a primary key, and contains such information as: the date the record is created, the date it is uploaded, location of the page images and the text, total number of page images scanned, current process in progress (by the PID or process identification number), the current stage of the process, the priority of the journal as established by NLM's Library Operations, and other data.

The Page Table contains data associated with a single scanned page. It keeps one level of a tree of relationships in which a journal issue has many page images, a page has several zones, zones have many text lines, and text lines have characters. A row in this table is created by the Scan process, but most of the information is filled in by the output of the OCR. It also contains information on the width and height of a page in 300 dpi units.

The Label Table defines all the zones, both from the scanner and from the OCR system. For most records it will reflect OCR related information at the zone level, such as zone sequence and zone coordinates, and also provide a link to the text output from the OCR. Each process creates only its own labels instead of modifying labels that were created by other processes, to preserve the historical data for journal issues processed through the system.

The Field Label Table defines all the fields that identify a citation, viz., article title, author, affiliation and abstract label among other labels.


The major tables are: WIP, Scanners, JournalIssue, JournalName, Publisher, Page, Features, Label, LabelRanking, and FieldLabels.

A complete listing of all the rules required for automated processing appears in the Rules Table. There are two types of rules: edit rules and format rules. Edit rules are for special processing to offset OCR errors in the context of a word. For example, a zero (0) occurring in the name "F0rd" would be reset to an "o". The more numerous format rules, performed in a series of loops, implement the reformatting of field syntax, e.g., author names appearing in a journal article in the conventional first name-middle initial-last name sequence would be reformatted to the last name-space-first and middle initials form required by MEDLINE. This table contains an identification number for each rule, the text string that triggers a rule, and a text string that the rule triggers. This is discussed further in Section 7 on Automated Reformatting.

MARS II database system format rules

Other tables keep track of performance data (time taken for a process to complete, number of text lines written per second, number of words spell checked, etc.), scanner settings, operator names and their work statistics, list of processes, the current stage a process is in, and errors that are tracked. Stored procedures were created to support reports of performance statistics. These tables are shown in the figure titled Instrumentation, below.



page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   
 

National Institutes of Health (NIH)National Institutes of Health (NIH)
9000 Rockville Pike
Bethesda, Maryland 20892

U.S. Dept. of Health and Human ServicesU.S. Dept. of Health
and Human Services

USA.gov Website