National Library of Medicine, HTTP://www.nlm.nih.gov Communications Engineering Branch Title Lister Hill National Center for Biomedical Communications, HTTP://www.lhncbc.nlm.nih.gov/
 

CEB Home
CEB Projects
Related Image Processing Work
Publications
Repositories
NHANES
Student Internships Site Index
Turning The Pages Online: http://archive.nlm.nih.gov/proj/ttp/intro.htm
Use MyMorph document conversion tool to make PDF files http://docmorph.nlm.nih.gov/docmorph/
Medical Article Records GROUNDTRUTH (MARG): http://marg.nlm.nih.gov/index2.asp
MD on Tap: http://mdot.nlm.nih.gov/proj/mdot/mdot.php
AnatQuest: http://anatquest.nlm.nih.gov/

page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


Automating the production of bibliographic records for MEDLINE

6.3 Rule-based algorithms in AL module

While all contiguous text regions on a page image are zoned, the only ones of interest in the current MARS system are the article title, author, affiliation and abstract. Since affiliation information could reside in the top part of the page as well as at the bottom, for labeling purposes, we define an "upper affiliation" and a "lower affiliation" zone. Hence, we have five possible labels. The remaining zones are labeled as "other". For each label type, there are four types of rules as shown in Table 6.3: rule types 1, 2 and 3 that are different for each label classification, and rule type 4 that is the same for all. Our rule-based algorithm consists of four steps.

In the first step, a probability of correct identification (PCI) is used in rule type 1. Every zone has five PCIs, one for each label. A PCI is equivalent to the probability of a zone possessing a particular label. The PCIs are derived empirically. For example, in the case of upper affiliation, when more than 30% of words in a zone belong to the affiliation word list, the PCI of upper affiliation is 100. Otherwise, PCI is equal to PercentOfAffiliation×100/30. In case of author, when more than 28% of words in a zone belong to the list of middle names and academic degrees, the PCI of author is 100. Otherwise, PCI = (PercentOfAcademicDegree + PercentOfMiddleName)×100/28. In this first step, when a zone has the highest PCI for a particular label, it is assigned that label.

The PCI thresholds of 30% and 28% for affiliation and author respectively are established heuristically. In the case of author, we often find there are two authors in an author zone, each author name usually consists of three words, and "and" is located between the author names. We find that there are middle initials and academic degrees associated with author names. So, a zone is likely to be labeled as author when the ratio of the sum of academic degrees and middle initials to the total number of words in the zone exceeds 2/7 or 28.6%. In the case of affiliations, it has been determined that a zone is likely to be labeled as affiliation when 30% of the words belong to the affiliation word list.

In the second step, the labeling results from step 1 are rechecked by rule type 4. For example, when two zones are both labeled as author but one of those zones is located between title and upper affiliation, and the other is located between upper affiliation and abstract, the latter is removed from the author category.

In the third step, in addition to rule type 2, rule types 1 and 4 are applied again to make sure that at least one zone is labeled as title, author, abstract, upper affiliation or lower affiliation. For example, when a zone initially labeled as author does not contain information relevant to author (NumberOfMiddleName=0 and NumberOfAcademicDegree = 0), its location is then used to do the labeling. That is, its label as author is verified by the facts that (a) it does not contain information related to title or upper affiliation zones, and (b) it is located between title and upper affiliation zones.

In the fourth step, problems caused by zoning errors such as a zone split into multiple zones are handled by all rules, and any remaining unlabeled zones are labeled.

120 rules were generated for zone labeling types 10000, 10006, 12000, 12006, and 12200, and an example of detailed rules to detect upper affiliation is shown Table 6.4.

Table 6.3 Rule Types used in AL Module
Rule Type Description
1 Use Probability of Correct Identification (PCI). Each label has its own PCI equation. Example: When a zone has a high PCI for a label zone (PCI > 100), the zone is assigned as the label zone.
2 When a label does not have any zone, which has PCI > 100, pick a zone, which has the highest PCI for the label, and assign the zone as the label.
3 Some features should be similar within the same label zones. I.e., when a title zone is divided into two separate zones, the two zones have similar FontSize, FontAttribute, MedianLineHeight, etc.
4 TopCoordinate of title < TopCoordinate of author < TopCoordinate of Upper affiliation < TopCoordinate of abstract author < TopCoordinate of Lower affiliation


Table 6.4 Example of rules to detect Upper Affiliation
Rule Type Rule Description
1 1. TopCoordinate < HeightOfArticle /2
2. BottomCoordinate < HeightOfArticle×3/4
3. NumberOfWord > 2
4. NumberOfAcademicDegree < 3 or
PercentOfAcademicDegree < 30
5. NumberOfMiddlename < 3 or
PercentOfMiddleName < 30
6. PercentOfCapitalCharacter < 50
7. NumberOfHeadtitle == NumberOfAbstract == 0
NumberOfIntroduction==0
8. If all of above conditions are satisfied {
   If ( NumberOfAffiliation > 2 ) {
      If ( PercentOfAffiliation > 30 ) PCI=100;
      Else PCI = PercentOfAffiliation×100/30;
   }
   Else {
      If ( PercentOfAffiliation > 30 ) PCI =50;
      Else PCI = PercentOfAffiliation×50/30;
      }
   }
   Else {
      PCI = 0
   }
2 If ( PCI < 100 ), pick a zone having the highest PCI for upper affiliation.
3 1. If ( PCI > 25 and the next zone has NumberOfReceived ==1 ) PCI = 100.
2. Distance from a zone to upper affiliation zone is smaller than any other label zones.
3. FontSize, FontAttribute, MedianOfLineHeight, and MedianOfLineSpace of a zone must be similar to upper affiliation zone.
4 TopCoordinate of title < TopCoordinate of author < TopCoordinate of affiliation < TopCoordinate of abstract

6.4 Research tool for labeling

As noted earlier, the zoning and labeling functions are integrated into one module, ZoneCzar. Since this is a daemon, there is no operator workstation. However, a GUI is provided for a supervisor or the development team to check on problems or progress. Apart from this, we created a research tool, Visual ZoneCzar, to help develop and test the algorithms used for labeling. Its design is based on Visual C++(6.0).


Figure 6.4 GUI of Visual ZoneCzar

The purpose of Visual ZoneCzar is to test the algorithms on page images from a new journal title to be included in the list of journal titles that may be processed automatically by MARS-2. If the existing algorithms fail to successfully label the zones from the new journal, rules are modified, and tests are repeated. This tool helps the researcher check and verify that the algorithmic rules are in fact applicable to a particular journal.

As shown in Figure 6.4, the GUI of Visual ZoneCzar has two windows. The left window displays zoning and labeling results on a TIFF image. The zones are displayed by red colored boundaries. The labeling results are displayed by different background colors and text. For example, the article title zone is red accompanied by the word TITLE in red, and the author zone is green with the word AUTHOR in green, and so on. Zones that are not of interest are in gray.

The right window displays the text in the zones that are shown labeled in the left window, and the labeling rules in the algorithm. In the example, the text in the zones that have been labeled as affiliation and abstract are shown. The 14×17 table in the middle of the window gives information on the rules being applied. A summary follows describing the contents of this table.

The row number in the table corresponds to the ordinal number of zone number in the OCR output data. The first column indicates zone number; the second column contains a number representing labeling results. In the second column, for example, the "3" means that zone 2 is labeled as a title. Other numbers (4, 5 or 7) would refer to author, upper affiliation, and abstract labels. The third to the thirteenth column shows the calculated PCIs for rubric, title, author, upper affiliation, word abstract, abstract, keyword, introduction, lower affiliation, upper received, and lower received for each zone. A PCI of 100 means that the zone is labeled as one of the five important labels which are title, author, upper affiliation, lower affiliation, or abstract. For the other labels (not important to the MARS system), a PCI of -99 is assigned. The fourteenth column has "100" when the zone was assigned a PCI = -99. The fifteenth column has "100" when the zone is none of the identifiable zones. The sixteenth column shows the rule number used to label the zone. In the case of the second row, there are "2", "3", "100", and "1301" in the first, second, fourth, and sixteenth columns. This is shorthand indicating that the zone number two is labeled as title by rule 1301. The third row shows that zone number three is labeled as author by rule 1400, and has PCI =50 to title label. The eighth row shows that zone number eight is labeled as introduction by rule 1900.

A four-digit number is used to identify a rule. The highest digit indicates the step of labeling process, the next digit indicates the label, and the lowest two digits indicate the rule number. For example, 1301 in the second row means that in step one (1) of the labeling procedure for the title label (3), rule one (01) was used.

Other information about the journal issue (MRI) obtained from the JournalName table in the database is displayed at the bottom of the right window.

The tool bar of Visual ZoneCzar offers the researcher twelve buttons to navigate and control the data. The first button displays the first page image in the journal issue, the second button displays the previous page, the third button displays a page in the middle of the group of pages, the fourth button displays the next page, and the fifth button displays the last page. The sixth and seventh buttons are to minimize and maximize the TIFF images. The eighth through the twelfth buttons control the zoning and labeling process. The AZ button runs the zoning module, DR OCR and DR AZ buttons display the OCR and zoning results, AL runs the labeling process, and ALL runs both zoning and labeling modules.

6.5 Performance in production

Currently the AL module can reliably process 2,027 journal titles from the 4,300+ titles indexed in MEDLINE. Since NLM receives bibliographic data for 580 of these directly from publishers, the actual number of titles that may be processed by MARS-2 is 1,447.

In Table 6.5 we show performance data for the month of February 2001 for 159 journal issues containing 2,524 articles processed by MARS-2. This collection exhibited four layout types. There were 101, 10, 37 and 11 journal issues in zone labeling types 10000, 10006, 12000, and 12200 respectively.

The data shows that 0.4% of the labeling errors is due to incorrect OCR output and 0.63% is due to poor zoning (AZ). The error rate attributed to the AL module itself is 0.20% when OCR and AZ are correct. The reason for the high error rate in the affiliation field is that text in this field is small sized and are frequently italicized, both factors contributing to poor detection by the OCR system. In overall performance, the AL module delivers an accuracy of 98.77%.

Table 6.5 Automatic labeling performance
Error Type Title Author Affiliation Abstract Totals % of Error
Bad OCR 0 1 9 0 10 0.40
Automated Zoning (AZ) 2 8 6 0 16 0.63
Automated Labeling (AL) 1 3 1 0 5 0.20
Totals 3 12 16 0 31 1.23
% of Error 0.12 0.48 0.63 0 1.23  

6.6 Ongoing research

As mentioned earlier, we used empirical methods to derive thresholds for the probability of correct identification (PCI) for each label, such as 28% and 30% of special word lists for PCI thresholds for author and affiliation. We plan to refine these figures by using statistical data, i.e., create histograms of every word list collected from the journals processed by MARS-2 for each label zone, and select thresholds based on these histograms.

We are continually increasing the number of journal titles accommodated by MARS-2, but we find that a number of these do not follow the relatively regular layout types that the system can process at present. Figure 6.5 shows examples of these irregular layouts. Figure 6.5(a) has the abstract to the left of the article title, Figure 6.5(b) has author and affiliation to the right of the title, and Figure 6.5(c) has author to the left of the title, all quite different from the "regular" layouts. One approach to dealing with these irregular layouts is to develop a template matching algorithm based on the average font size and the average top-left and bottom-right coordinates of all important zones. These features will be stored in the database in a journal-specific manner. When a journal issue with irregular layout is processed, the AL module will read the zone coordinates and the font size of the text in the zone, and match them against the stored information.


(a)

(b)


(c)
Figure 6.5 Examples of journals exhibiting irregular layout. (a) Abstract is located to the left of the title. (b) Author and affiliation are located to the right of the title. (c) Author is located to the left of the title.


page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


    Return to top of page

CEB Home | CEB Projects | Related Work | Publications | Repositories | NHANES | Site Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility
USA.gov | Need a plug-in? | RSS

URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_7.php
Last updated December 06, 2001

Send questions or comments about this site to