National Library of Medicine, HTTP://www.nlm.nih.gov Communications Engineering Branch Title Lister Hill National Center for Biomedical Communications, HTTP://www.lhncbc.nlm.nih.gov/
 

CEB Home
CEB Projects
Related Image Processing Work
Publications
Repositories
NHANES
Student Internships Site Index
Turning The Pages Online: http://archive.nlm.nih.gov/proj/ttp/intro.htm
Use MyMorph document conversion tool to make PDF files http://docmorph.nlm.nih.gov/docmorph/
Medical Article Records GROUNDTRUTH (MARG): http://marg.nlm.nih.gov/index2.asp
MD on Tap: http://mdot.nlm.nih.gov/proj/mdot/mdot.php
AnatQuest: http://anatquest.nlm.nih.gov/

page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


Automating the production of bibliographic records for MEDLINE

5.2 Evaluation of automated zoning

Following initial testing and refinement, the zoning algorithm was tested with a set of page images from 59 journal issues that would become the first set of journals to be processed by the MARS-2 system. Journals selected had a page layout in which the title, authors, affiliations and abstract were all in one column, and appeared on the page in that order. Table 5.2 summarizes the scores for the 295 images in this set. Overall, of the 1,180 possible zones of interest, the zone correction program generated 1,155 correct zones, for a correct rate of 97.9%.

Table 5.2 Results of zone correction for 295 pages from 59 journal issues
Field Error Type
  split too big too small merged totals % images with an error in this field
Title 7       7 2.4
Author 1     4 5 1.7
Affiliation 4     5 9 3.1
Abstract 3     1 4 1.4
totals 15 0 0 10 25  
% images with this error 5.1 0 0 3.4    

5.3 Implementation

Based on the low error rates achieved in testing, the automatic zone correction algorithm was implemented for the MARS-2 system. A C++ zone correction class was written in the Microsoft Visual Studio development environment. The class is incorporated with the ZoneCzar module that also includes the automated labeling function described in Section 6.

5.4 Performance in production

The original zone correction algorithm has continued to evolve in response to feedback from production operators and to observations from continued internal evaluation. As more journal layout types are added to those processed by MARS-2, code to accommodate new circumstances has been added to the algorithm, but the overall design has not changed. For example, to correctly zone pages in which affiliations are found at or near the bottom of the page, usually in small-sized fonts, computed threshold values are different for lines and zones that begin at the bottom third of the page than they are for the rest of the page. Although performance has remained consistently good, we anticipate challenges as we increase the number of journal titles and layout types accommodated by MARS in the future.

6. Automated labeling

Once the contiguous text regions in a bitmapped page image are zoned, the next step is to label the zones, i.e., identify each zone as one of the bibliographic fields of interest. The figure below shows the sequence of steps: the bitmapped TIFF image of the scanned page, the output of the automated zoning module (AZ) and the output of the automated labeling module (AL).

Figure 6.1 Results of AZ and AL Modules. (a) Bitmapped page image (b) AZ output, and (c) AL output
No zones are selected.
(a)
Scanned image has been sectioned into zones.
(b)
Scanned image showing labeled zones.
(c)

Image analysis techniques for document labeling proposed in the literature33-37 are based mostly on the layout (geometric) structure and/or the logical structure of a document. Hones et al.33 describe an algorithm for layout extraction of mixed-mode documents, and the classification of these documents as text or non-text. Taylor et al.34 describe a prototype system using a feature extraction and model-based approach. Tsujimoto et al.35 present a rule-based technique based on the transformation from a geometric structure to a logical structure. Tateisi et al.36 propose a method based on stochastic syntactic analysis to extract the logical structure of a printed document. They use simple rules to label documents into three classes. Niyogi et al.37 use a rule-based system to label newspaper contents into thirteen labels such as headline, text paragraph, photograph, and so on. These labeling techniques rely mostly on rule-based algorithms, but other mechanisms such as artificial neural networks (ANN) and decision trees are also investigated.

One drawback to ANN and decision tree methods is that they need training as a pre-processing stage. That is, the algorithms need to be re-trained whenever a new document (in our case, a journal layout not seen previously) is encountered, and the training time is proportional to the number of journal titles to be processed. Not only is this time consuming, it also makes it difficult for exceptional situations to be handled quickly. In addition, these techniques pose difficulties in readily using geometric information, e.g., the geometry between zones. Rule-based algorithms, on the other hand, do not need re-training, can employ geometric information readily, and moreover, can accommodate exceptional cases (slight divergence from a known layout type) by the addition of new rules. Since the 4,300+ journal titles indexed in MEDLINE exhibit a wide range of layout types, such exceptional cases can occur frequently. An automated labeling system needs to handle a multiplicity of layout types and exceptional cases quickly, and without extensive pre-processing and training.

Our research in this area focused on three approaches: the rule-based algorithmic approach, an ANN method, and a template-matching technique. Our experiments and findings are reported in the literature.38-40 Based on these experiments, we decided to implement our labeling system on rule-based algorithms since this approach delivered a high accuracy rate, high speed of execution, and furthermore was amenable to modification as new layout types were added.

Our approach relies on data from the OCR system which delivers information at the zone, line and character level:

Zone level Zone boundaries, number of text lines
Line level Line boundaries, number of characters, average character height
Character level 8-bit character code, confidence level (1= lowest, 9 = highest), bounding box, font size, font attribute (normal, bold, underlined, italics, superscript, subscript, and fixed pitch)

The OCR output data is used to generate geometric and non-geometric features that, in turn, are used to create rules. Geometric features are based on a zone's location, order of appearance, and dimensions. For example, the article title zone is usually located in the top half of the page, followed by author, affiliation and abstract, in that order.

Non-geometric features are derived from the text contents of a zone, aggregate statistics, and font characteristics. For example, some zones can be characterized by the words in them, and the frequency with which they occur. In such cases, word matching is an important technique to generate non-geometric features in the AL module. For example, a zone has a higher probability of being labeled as "affiliation" when it has words representing country, city and school names. Also, a zone positioned between the words "abstract" and "keywords" is more likely to be an abstract than any other bibliographic field. Fifteen database tables containing word lists have been assembled as shown in Table 6.1. Table 6.2 shows examples of geometric and non-geometric features.

Word matching relies on search algorithms such as hash tables, binary search tree, digital search tree, ternary search tree, etc. We chose the ternary search tree on account of its ability to yield both the time efficiency of the digital search tree and the space efficiency of binary search trees, and its ability to perform advanced searches such as partial-matching and near-neighbor search. Proposed by Bentley and Sedgewick in 1997, this technique has been used for several years for searching English dictionaries in a commercial OCR system built at Bell Labs.56

Table 6.1 Word list tables.
Table Name Words in the Table
Rubric Review, Orginal Article, etc.
KeyOfTitle Study, case, method, etc.
Author Smith, John, Kim, etc.
AcademicDegree Ph.D., MD, RN, etc.
Affiliation University, Department, Institute, etc.
Abstract Abstract, Summary, Background, etc.
Structured Abstract Aim, Result, Conclusion, etc.
Keyword Keyword, Index word, etc.
Received Received, Revised, Accepted, etc.
Introduction Introduction, Introduzione, etc.
ExtraDataInAffiliation Corresponding, Address, To whom, etc.
ExtraDataInLowerAffiliation Mail, fax, tel, etc.
Date January, February, 2000, etc.
Publisher Elsevier, John Wiley, etc.
JournalName Diabetes, endocrinology, etc.

Table 6.2 Features used in the Automated Labeling module.
Zone Features Variable Names
Geometric Features:  
Zone coordinates TopCoordinate, BottomCoordinate,
LeftCoordinate, RightCoordinate
Zone height and width HeightOfZone, LengthOfZone
Median value of height, length and space of lines MedianLineHeight, MedianLineLength, MedianLineSpace
Difference between the bottom and top coordinates
of the bottom-most and top-most zone
HeightOfArticle
Zone order in sequence of top left edge ZoneOrder
Non-Geometric Features:  
Biggest and smallest font sizes in an article MaximumFontSize, MinimumFontSize
Number of text lines NumberOfLine
Number of characters and words NumberOfCharacter, NumberOfWord
Number of capital characters NumberOfCapitalCharacter
Dominant font attribute and font size FontAttribute, FontSize
Confidence of characters Confidence
Number of "M.D.", "Ph.D.", "RN", etc. NumberOfDegree
Number of middle names, "Jr", "Sr", "II", etc. NumberOfMiddleName
Number of city, state, country, school, etc. NumberOfAffiliation
Number of "abstract", "summary", etc. NumberOfAbstract
Number of "keywords", "index words", etc. NumberOfKeyword
Number of "review", "article", etc. NumberOfHeadtitle
Number of "received", "accepted", etc. NumberOfReceived
Number of "received", "accepted", etc. NumberOfReceived
Percentage of academic degrees per word PercentOfAcademicDegree
Percentage of middle names per word PercentOfMiddleName
Percentage of affiliations per word PercentOfAffiliation
Percentage of capital characters per zone PercentOfCapitalCharacter

6.1 Definition of layout types

As noted, the MEDLINE database contains bibliographic records from over 4,300 journals. The physical layout of the first page of articles in these journals, and the order in which the five important zones (title, author, upper affiliation, lower affiliation, and abstract) appear on the first page may be used to categorize the zone labeling type for a given journal. Figure 6.2 shows examples of common layout types consisting of a single column, or a combination of single and multiple columns. The numbers in the gray blocks indicate block numbers to help with the definitions of the more common zone labeling types described in Table 6.3.


Figure 6.2 Examples of common journal layout types. (a) Layout type 1; (b) Layout type 11; (c) Layout type 12; (d) Layout type 121; (e) Layout type 122.

The five important zones frequently appear in "first regular" or "second regular" zone order. In the "first regular" zone order, the title is near the top of the page, followed by author, affiliation in the upper part of the page (upper affiliation), and abstract. In the "second regular" zone order, the title is followed by author and abstract, with the affiliation appearing in the lower part of the page.

The zone labeling type for each journal is determined by the journal layout type and the zone order. For example, if the journal pages are of layout type 121 [Figure 6.2(d)] and the affiliation appears in block 4 (second regular), the zone labeling type is defined as Type 12006. Other labeling types are described in Table 6.3.

Table 6.3 Description of zone labeling types
Zone Labeling Type Includes Layout Type(s) Zone order(s) Description
Type 10000 1,11,12,121, 122 First regular Title, author, upper affiliation, and abstract are in block 1.
Type 10006 11 Second regular Title, author, and abstract are in block 1. Lower affiliation is in block 2.
  121 Second regular Title, author, and abstract are in block 1. Lower affiliation is in block 4.
Type 12000 12, 121 First regular Title, author, upper affiliation are in block 1. Abstract is in block 2, and may extend into block 3.
  122 First regular Title, author, upper affiliation are in block 1. Abstract is in block 2.
Type 12006 121 Second regular Title and author is in block 1. Lower affiliation is in block 4. Abstract is in block 2, and may extend into block 3.
Type 12200 122 First regular Title, author, upper affiliation is in block 1. Abstract is in block 2 and 3.

6.2 Structure of AL module

Figure 6.3 shows the structure of the AL module and its interaction with the MARS database whose tables contain information on every journal title (ISSN number). This information includes layout type, physical size, affiliation location, abstract type, feature type, and feature value. After page images from a particular journal issue are processed by the AZ module, and the journal title (ISSN) is identified to the JournalName table, the AL module retrieves all the relevant information from this table, and activates an AL algorithm related to the zone labeling type. The output of the AL module, the identification of the page zones, are written to the LabelRanking table in the database, for further downstream processing.


Figure 6.3 Structure of automated labeling module


page 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19   


    Return to top of page

CEB Home | CEB Projects | Related Work | Publications | Repositories | NHANES | Site Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility
USA.gov | Need a plug-in? | RSS

URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_6.php
Last updated December 06, 2001

Send questions or comments about this site to