National Library of Medicine, HTTP://www.nlm.nih.gov Communications Engineering Branch Title Lister Hill National Center for Biomedical Communications, HTTP://www.lhncbc.nlm.nih.gov/
 

CEB Home
CEB Projects
Related Image Processing Work
Publications
Repositories
NHANES
Student Internships Site Index
Turning The Pages Online: http://archive.nlm.nih.gov/proj/ttp/intro.htm
Use MyMorph document conversion tool to make PDF files http://docmorph.nlm.nih.gov/docmorph/
Medical Article Records GROUNDTRUTH (MARG): http://marg.nlm.nih.gov/index2.asp
MD on Tap: http://mdot.nlm.nih.gov/proj/mdot/mdot.php
AnatQuest: http://anatquest.nlm.nih.gov/

page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


Automating the production of bibliographic records for MEDLINE

7 Automated reformatting

Following the labeling of the zoned text, the contents of each zone corresponds to the article title, author names, affiliation, and abstract that we are seeking. However, the text in the first three zones is rarely in the syntactic forms required by MEDLINE's conventions. The automated reformatting stage (Autoreformat) is designed to convert this text to the desired formats to eliminate manual correction at the reconcile stage.

The reformatting of the title and author fields is implemented by predefined rules. Rules for the title field implement retaining the capital case for the first letter of the first word, and the de-capitalization of all the other words with the exception of acronyms. An example: "Medical Management of AIDS Patients" becomes "Medical management of AIDS patients," as required in MEDLINE. Rules for author fields take into account characters that delimit authors in a multiple-author list; tokens to be eliminated, such as Ph.D., M.D.; tokens to be converted, such as II to 2nd; and "particles" to be retained, such as "van." For example, the author name appearing on the printed page as Eric S. van Bueron, Ph.D. becomes Van Bueron ES as required in MEDLINE.

Based on journal title and label (author or title), the reformatting module selects a subset of rules from the inclusive set of all rules. The selected rule set and the OCR output text are passed to the reformatting algorithm, and as each rule is applied, the OCR string is modified. Our experiments before implementing this function in the production system correctly reformatted more than 97% of the authors and titles from a test set of 1,857 processed articles. This performance may be expected to improve with the addition of rules derived from production data.

The reformatting strategy for the affiliation field is quite different from the above. The OCR data for an affiliation field could contain many affiliations, since each author may have a different affiliation. This data is often difficult to reformat. One reason is that only the affiliation of the first author is to be retained, in line with MEDLINE conventions. Another reason is that the desired data is spread out over the entire field and not contiguous. For example, in a 30 word affiliation zone, we may only want to retain words 1-8, 12-14, and word 30. Our method involves probability matching of the OCR output text to historical data of ~130,000 unique affiliations.

In the case of affiliations, in addition to the processing at the reformat stage, we attempt to improve the recognition of affiliations by lexicon-based methods described in Section 8.

7.1 Reformatting the Author field

Reformatting the author field uses forward chaining rules-based deduction. The reformat module can have many rules defined for a particular field. Each rule has a number of requirements among which are that it must

  • Be associated with a specific journal title (ISSN number);
  • Fall into one of eight categories as listed in Table 7.1. The categories are pre-defined in the reformat module and are required to help in our conflict resolution strategy, which in our case is specificity ordering. Whenever the conditions of one triggering rule is a superset of another rule, the superset rule takes precedence in that it deals with more specific situations. An example of this is shown later.

The example column in Table 7.1 shows the complete reformatted field. Note that a single rule or category does not necessarily complete the reformatting, but may need to be combined to achieve correct reformatting of the author field.

With the eight categories defined, the first step in using the reformat module for a given ISSN is to define which rules are appropriate for a particular ISSN (journal title), since the printed format varies widely among journals. As an example, in one journal the authors appear as:

Glenn M Ford, MD, John Smith, PhD, and John Glover

This can be difficult to parse with a default set of rules, such as ', and' and ',' so that other rules need to be defined. By defining, in the database, the rules for a specific journal title over a specific period of time we can customize the rules to work for unusual or specific cases. Journals often change formats over the years to accommodate new publishers or printers. Therefore the rules may need to change even though the journal title remains the same.

The above example fails in the default rule set that only has ',' and ', and' as the author delimiters because this would incorrectly identify 'MD' and 'PhD' as author names. To accommodate this journal (and others like it) a high priority rule trigger list was created for author delimiters such as ', MD', ', PhD', 'Mr.', 'Dr.', and other formal titles.

To avoid conflict among rules, each word chain is passed through all the categories recursively until no more rules are triggered. As long as we have an antecedent with consequences we continue to process the word chain. Using the forwarding chaining method, when an "if statement" is observed to match an assertion, the antecedent (i.e., the if statement) is satisfied. When the entire set of if statements are satisfied, the rule is triggered. Each rule that is triggered establishes, in a working memory node, that it was executed. During conflict resolution the reformat module decides which rules take priority over others via specificity ordering. An example would be:

Reduce category executes on 'John Smith II' and makes this 'J S II'
Convert category executes 'John Smith II' and marks Smith as convert pre-word and 'II' to '2nd'.

Our conflict resolution method specifies that the convert category is more specific than the reduce category, thus keeping the word 'Smith' and '2nd'. In addition, the pre-word convert flag in this particular example signals the conflict resolution manager to keep 'Smith', initialize 'J', and append '2nd '. This is possible because we have retained our original text and the converted text. The text did not change and an integrated rule has informed us that the word 'Smith' has remained unchanged, and by examining all words, we deduce that this is the last name.

Example Before/After: Before - John Smith II After - Smith J 2nd

At the category level, the conflict resolution strategy is specificity ordering. There is also a conflict resolution strategy within a given category: priority list rule ordering. Rules within a given category are assigned a priority level to avoid conflicts. An example of this is the following list of authors appearing on the printed page:

Glenn Ford, John Smith, and David Wells

We have the following author delimiter rules defined

',' and ', and'

However, the ',' is assigned priority 1, and the ', and' is assigned a higher priority 2. If we did not give a higher priority to ', and' we could end up with 'and' as part of the author name or create a null value.

In ground truth testing of the author reformat rules system we tested 1,857 authors from OCR data. Of that number, 41 were reformatted incorrectly, for a 97.29% correction rate. Of those 41, all 41 were missing rules defined for a given case. An example of a missing rule is given in the case of an author field that reads:

Glenn M. Ford, Jr., John Smith.

By simply adding the rule [', Jr. ' author delimiter priority 2], and with no changes in code, we achieved 100% correct reformatting in the test set.

7.2 Reformatting the Article Title field

The title field uses the same principles as in the author rules system, but requires fewer rules or categories. Of the eight rule categories required for reformatting authors, only three are needed to reformat titles: Uppercase, Lowercase and First Letter Upper.



page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   




    Return to top of page

CEB Home | CEB Projects | Related Work | Publications | Repositories | NHANES | Site Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility
USA.gov | Need a plug-in? | RSS

URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_8.php
Last updated December 06, 2001

Send questions or comments about this site to