National Library of Medicine, HTTP://www.nlm.nih.gov Communications Engineering Branch Title Lister Hill National Center for Biomedical Communications, HTTP://www.lhncbc.nlm.nih.gov/
 

CEB Home
CEB Projects
Related Image Processing Work
Publications
Repositories
NHANES
Site Index
Turning The Pages Online: http://archive.nlm.nih.gov/proj/ttp/intro.htm
Use MyMorph document conversion tool to make PDF files http://docmorph.nlm.nih.gov/docmorph/
Medical Article Records GROUNDTRUTH (MARG): http://marg.nlm.nih.gov/index2.asp
MD on Tap: http://mdot.nlm.nih.gov/proj/mdot/mdot.php
AnatQuest: http://anatquest.nlm.nih.gov/

Student Internships

Document Image Analysis and Understanding R&D





A report to the Board of Scientific Counselors
Communications Engineering Branch
Lister Hill National Center for Biomedical Communications
National Library of Medicine

October 2001

Contents

  1. Background
  2. Project objectives
  3. Project significance
  4. Page segmentation
  5. Automated labeling
  6. Automated reformatting
  7. Lexical analysis to improve recognition
    1. Lexical analysis to reduce highlighted words
    2. Lexical analysis to improve recognition of Affiliation
  8. DIAU: advantage
  9. Next tasks
    1. Ground truth data: PathFinder
    2. Data extraction from online journals
    3. Alternative method for text verificatio
  10. References
  11. Questions for the Board
Organization of this report. This report presents our research in document image analysis and understanding. Following the background statement, project objectives, and project significance, Sections 4-6 describe the image analysis in page segmentation, automated labeling and reformatting. Section 7 describes lexical analysis work to improve recognition; Section 8 gives performance data; Section 9 outlines next steps. References to the literature appear in Section 10. Questions for the board appear at the end.

1. Background

Research in document image analysis and understanding (DIAU) has been part of the Communications Engineering Branch's longstanding involvement in document and biomedical imaging from many different standpoints: image capture, storage and retrieval, lossy and lossless compression, image enhancement and other types of image processing. In particular, research in document imaging has been applied to the design and development of prototype systems to serve as testbeds for investigations into electronic archiving and preservation of library materials,1-4 automated interlibrary loan systems,5,6 on-demand document delivery,7,8 and Internet-enabled document delivery and management for the end user.9-12 In addition, we have engaged in several years of related and relevant R&D in document image analysis and understanding.13-19

The aim of document image analysis and understanding is to automatically recognize and extract textual or graphical material from digitized documents as closely as possible to the results of human action. This active research area includes work in table extraction and understanding,20-22 separating textual and graphical regions,23,24 extracting bar charts,25 electrical circuit symbology from circuit diagrams,26 symbols and text labels from geographic maps,27 symbols and connections from engineering drawings,28 zip codes from addressed envelopes,29 mathematical equations and formulas from other text,30 detecting duplicate documents from document image databases (Gulf War Declassification Project),31 and other application domains.

The application we are interested in is the automatic extraction of bibliographic data from biomedical journals indexed in MEDLINE. Our principal targets are the article title, author names, institutional affiliations and the abstract, all items that usually appear on the first page of a journal article. The motivation for this investigation is the promise of labor savings compared to the conventional keyboarding of this data into MEDLINE, and also the more timely availability of this data.32

Most applications based on DIAU research involve multiple processes, starting with the scanning of the document page to produce a bitmapped image, the conversion of the image data to text by optical character recognition (OCR), page segmentation that blocks regions of contiguous text into zones, labeling that classifies (identifies) the zones, and reformatting that organizes the text in the zones in the formats desired by the application. Apart from the manual scanning stage, the four other stages are automatically done by daemon processes to yield the desired information. In addition, in any practical implementation of an image analysis system, verification of the processed data must be done manually, as must the entry of data that are important to the final product (in our case, a bibliographic record) but which cannot be automatically extracted, e.g., because they do not appear on the page that is scanned. DIAU techniques rely on image features directly extracted from the bitmapped images, non-geometric data from the OCR system and lexical analyses using string or word pattern matching. Our work exploits all of these approaches.

In this report we focus on the techniques for page segmentation (automated zoning), automated labeling and reformatting. We also discuss the lexical analysis research that improves the recognition of words incorrectly detected by the OCR system. In addition, a comparison of level of effort between a completely manual approach to capturing bibliographic data and ours in which several stages are automated appears in Section 8. This is followed by an outline of next steps in this project. References to the literature appear at the end.

Based on our research we have designed, developed and built a system code-named MARS for Medical Article Records System. In addition to its role as a testbed providing data for research, MARS is an operational production system delivering one-third of the NLM's total bibliographic data requirements. The system is described in a recent 91-page document "Automating the production of bibliographic records for MEDLINE" accessible on our website33. This document gives a comprehensive account of: the design considerations and implementation details of the automated modules; the design and organization of the central database that controls the workflow, and which stores all data (into and out of the various processes); the evaluation criteria and testing that led to the selection of the OCR system; the design and implementation of the operator-controlled workstations for scanning, editing, reconciling (verification), and necessary tools for production control and supervision; the design of research tools that aid the design of our algorithms; and a detailed breakdown of performance statistics.

2. Project objectives

The objectives of this project are to:

  1. research and apply document image analysis and understanding techniques to the problem of automating the extraction of bibliographic information from biomedical journals;
  2. build a practical production system for this purpose, and use it as an experimental testbed to conduct research in document image analysis and understanding techniques, to identify opportunities for improved performance, and to optimize the performance of manual processes inevitable in a practical system;
  3. redesign and modify subsystems, or create new subsystems, to achieve improved performance;
  4. provide ground truth data to enable the computer science and informatics research communities to conduct further image analysis research toward the development of algorithms, tools and techniques for automated data extraction and other applications.

3. Project significance

Document image analysis and understanding is key to automated data extraction from digitized documents in many application domains, as noted in Section 1. At NLM, the application of immediate interest is the automated extraction of bibliographic data from digitized pages of biomedical journals indexed for MEDLINE. There are two principal reasons for this application: first, the gradual rise of labor costs; and second, the unrelenting increase in the amount of data that needs to be entered into databases from paper-based information. This is as true for MEDLINE as for the hundreds of databases produced in every discipline that rely on laborious keyboard entry of bibliographic information from articles in journals, e.g., article title, author names, institutions, abstract, dates, page numbers, etc. Image analysis and understanding techniques provide the basis for the development of automated systems that promise a cheaper alternative to keyboarding, and a more timely availability of bibliographic data for the public.

4. Page segmentation

In most DIAU techniques for automated text extraction, the first step after the conversion of the bitmapped image by the OCR system is page segmentation: to block out ("zone") the regions of contiguous text, which in our application are those text groups corresponding to the bibliographic fields of highest interest to us: viz., article title, authors, affiliation, and abstract.

Much of the research reported in the literature employ methods analogous to those used to isolate and separate characters (symbol isolation) to segment page images into zones. A brief survey of automatic zoning methods is given in Jain34. Approaches include "top-down" methods35, which segment a page by x-cuts and y-cuts into smaller regions, "bottom-up" methods36,37, which recursively grow homogeneous regions from small components, and combinations of both34,38. Tradeoff factors include: granularity (finding small enough zones), computation time, and sensitivity to input parameters such as noise, skew and page orientation.39-41 Top-down methods tend to be faster and less sensitive to input parameters and page orientation, but require pages to have a "Manhattan layout", i.e., the blocks may be separated by vertical and horizontal lines. Bottom-up and combination methods often result in greater accuracy at the expense of computational complexity and sensitivity to input parameters. All of these methods zone the page using image data alone, prior to OCR conversion. Since the reported performance is variable, and because rich secondary data is available from our OCR system, our approach, in contrast, is to exploit the output data of the OCR system to implement automatic zoning.

The commercial OCR system used in the MARS system includes a package to perform automatic zoning, but with inadequate accuracy. The most common error is that zones are too large and include more than one significant text group. Figure 4.1 illustrates a typical case in which the title, author, abstract and affiliation are all in one zone, along with extraneous publication identification. Figure 4.2 illustrates another case where, in addition to the previous problem, a two-column abstract is grouped inappropriately into a single zone. In this example, the text lines in the two columns are joined, disrupting the proper reading order. For example, the middle text of the first line of the zone is incorrectly read as "..models have opment of..."

Figure 4.1 An example of large zones generated by the commercial OCR system. Figure 4.2 A second example of large zones generated by the commercial OCR system.

Correct zones are critical to downstream processes in systems based on DIAU. In our application, the stage that follows automated zoning is the automated labeling of the zones as title, authors, affiliation and abstract.42 This complex labeling process uses several items of information in each zone to determine its identity. Information used to label a zone include absolute and relative location of the zone, and key words within the zone. Clearly, the zone region must be correct if it is to provide useful information to the labeling algorithms.

Downstream from automatic zoning and labeling, there is often a requirement for the text to be reformatted to comply with syntactic rules (in our case, MEDLINE conventions).59 This process also depends on correctly sized and labeled zones to be effective. Incorrect zones confound reformatting, ultimately requiring time-consuming manual intervention when the captured text is verified, thereby offsetting the advantage expected from an automated system.

Since we cannot depend on the commercial OCR system to correctly zone images, and seek to eliminate manual zoning, we developed an automatic zoning capability. However, rather than starting from scratch, we combine the automated zoning capability of the OCR system with our added functionality for zone correction.

4.1 Methods and Procedures

As noted earlier, in addition to ASCII text, the OCR system provides information about each of the converted characters in the output file. This information includes the level of confidence that the character was correctly recognized, character attributes such as italic or bold, character point size, and the x,y coordinates of the rectangle bounding the character (bounding boxes).43 Thus we have both geometric and non-geometric feature information available for each converted character. Our approach is to draw upon these features to group text into correct zones. For example, we use the bounding box coordinates to determine which characters are grouped closely in the same region on the page. Information on character size and attributes provide additional clues for keeping groups of adjacent characters together or placing them in separate zones.

Our zone correction method uses both top-down and bottom-up design strategies,44 used by other investigators solely on image data, on our OCR output (non-image) data. Our procedure is outlined in Table 4.1.

Table 4.1 Zone correction program steps
  Input Function Output
1. Zones and data from OCR system Separate zones into text lines Text lines
2. Text lines Separate lines into fragments Text lines, fragments
3. Lines and line fragments Combine lines vertically into zones Initial zones
4. Initial zones Combine zones horizontally into zones Final zones

Figure 4.3 shows an illustration of the method. The first step in creating new zones is to disassemble the original zones from the OCR system. Each zone is divided into individual text lines. In step 2, lines are further split horizontally into multiple lines when the space between words exceeds a distance threshold (empirically determined). This occasionally results in unnecessarily splitting lines into multiple parts, but is needed in order to split lines that originally span across two closely spaced columns. Some of these lines will be rejoined in later steps. The bounding box enclosing each line is computed, as are several features such as percent italic characters and average character height. Some character features, such as bold or italic, are available directly from the OCR output data. Others, such as character height or case (upper or lower), are computed from the OCR output data.

Figure 4.3 Zone correction program steps

In step 3, we combine the lines vertically into initial zones. The criteria for combining are that (a) the vertical distance between lines must be less than a threshold (again, empirically determined); (b) either the left edge, right edge or midpoint must be horizontally aligned; and (c) the features computed in the previous step must be similar. When a line is added to a zone, the zone's rectangular boundary is expanded to include the new line. Then all remaining lines are checked to see if they fall within the new zone. If so, they are added to the zone. Many of the horizontally split lines are recombined in this way.

On rare occasion, some zones created in step 3 are too narrow. In this event, the fourth and last step is to combine such zones horizontally using criteria similar to those in the previous step. Here, the initial zones are combined if (a) the horizontal distance between the zones is less than a threshold; (b) either the top or bottom edges of the zones are vertically aligned; and (c) the computed features of the two zones are similar. When zones are thus merged, a new zone boundary rectangle is created to include both zones. Any smaller zones that fall within the rectangle are subsumed within this zone.

Figures 4.4 and 4.5 show the results of these steps applied to the two images used as examples in Figures 4.1 and 4.2. In both of these images, the title, author, affiliation and abstract are enclosed in separate zones, as required. In addition, in Figure 4.5, the two columns of the abstract are in separate zones. These two zones will be identified as abstract by the automated labeling process, which follows the zone correction process, and the enclosed text will be organized in the proper reading order.

Figure 4.4 Correct zones, generated by the zone correction algorithm. Figure 4.5 Another example of zones generated by the zone correction algorithm.
Table 4.2 Results of zone correction for 295 pages from 59 journal issues
Field Error Type
  split too big too small merged totals % images with an error in this field
Title 7       7 2.4
Author 1     4 5 1.7
Affiliation 4     5 9 3.1
Abstract 3     1 4 1.4
totals 15     10 25  
% images with this error 5.1 0 0 3.4    

5. Automated labeling

The step following page segmentation is to label the zones, i.e., identify each zone as one of the bibliographic fields of interest. The figure below shows the sequence of steps: the bitmapped TIFF image of the scanned page, the output of the automated zoning module (AZ) and the output of the automated labeling module (AL).

Figure 5.1 (a) Bitmapped page image (b) AZ output, and (c) AL output

(a)

(b)

(c)

Image analysis techniques for document labeling proposed in the literature45-47 are based mostly on the layout (geometric) structure and/or the logical structure of a document. Hones et al.45 describe an algorithm for layout extraction of mixed-mode documents, and the classification of these documents as text or non-text. Taylor et al.46 describe a prototype system using a feature extraction and model-based approach. Tsujimoto et al.47 present a rule-based technique based on the transformation from a geometric structure to a logical structure. Tateisi et al.48 propose a method based on stochastic syntactic analysis to extract the logical structure of a printed document. They use simple rules to label documents into three classes. Niyogi et al.49 use a rule-based system to label newspaper contents into thirteen labels such as headline, text paragraph, photograph, and so on. These labeling techniques rely mostly on rule-based algorithms, but other mechanisms such as artificial neural networks (ANN) and decision trees are also investigated.

One drawback to ANN is that training is required as a pre-processing stage. That is, the algorithms need to be re-trained whenever a new document (in our case, a journal layout not seen previously) is encountered, and the training time is proportional to the number of journal titles to be processed. Not only is this time consuming, it also makes it difficult for exceptional situations to be handled quickly. In addition, these techniques pose difficulties in readily using geometric information, e.g., the geometry between zones. Rule-based algorithms, on the other hand, do not need re-training, can employ geometric information readily, and moreover, can accommodate exceptional cases (slight divergence from a known layout type) by the addition of new rules. Since the 4,300+ journal titles indexed in MEDLINE exhibit a wide range of layout types, such exceptional cases can occur frequently. An automated labeling system needs to handle a multiplicity of layout types and exceptional cases quickly, and without extensive pre-processing and training.

In this section we report on our research focusing on three approaches: the rule-based algorithmic approach (Sections 5.1-4), an ANN method (Section 5.5), and a template-matching technique (Section 5.6). Our experiments and findings are also reported in the literature.50-52 Based on these experiments, we decided to implement our labeling system on rule-based algorithms since this approach delivered a high accuracy rate, high speed of execution, and furthermore was amenable to modification as new layout types were added. The structure of the AL module and its interaction with tables in the MARS database appear in our extended report.33

All three of our techniques rely on data from the OCR system which delivers information at the zone, line and character level:

Zone level Zone boundaries, number of text lines
Line level Line boundaries, number of characters, average character height
Character level 8-bit character code, confidence level (1= lowest, 9 = highest), bounding box, font size, font attribute (normal, bold, underlined, italics, superscript, subscript, and fixed pitch)

The OCR output data is used to generate geometric and non-geometric features that, in turn, are used to create rules. Geometric features are based on a zone's location, order of appearance, and dimensions. For example, the article title zone is usually located in the top half of the page, followed by author, affiliation and abstract, in that order.

Non-geometric features are derived from the text contents of a zone, aggregate statistics, and font characteristics. For example, some zones can be characterized by the words in them, and the frequency with which they occur. In such cases, word matching is an important technique to generate non-geometric features in the AL module. For example, a zone has a higher probability of being labeled as "affiliation" when it has words representing country, city and school names. Also, a zone positioned between the words "abstract" and "keywords" is more likely to be an abstract than any other bibliographic field. Fifteen database tables containing word lists have been assembled as shown in Table 5.1. Table 5.2 shows examples of geometric and non-geometric features.

Word matching relies on search algorithms such as hash tables, binary search tree, digital search tree, ternary search tree, etc. We chose the ternary search tree53 on account of its ability to yield both the time efficiency of the digital search tree and the space efficiency of binary search trees, and its ability to perform advanced searches such as partial-matching and near-neighbor search.

Table 5.1 Word list tables
Table Name Words in the Table
Rubric Review, Orginal Article, etc.
KeyOfTitle Study, case, method, etc.
Author Smith, John, Kim, etc.
AcademicDegree Ph.D., MD, RN, etc.
Affiliation University, Department, Institute, etc.
Abstract Abstract, Summary, Background, etc.
Structured Abstract Aim, Result, Conclusion, etc.
Keyword Keyword, Index word, etc.
Received Received, Revised, Accepted, etc.
Introduction Introduction, Introduzione, etc.
ExtraDataInAffiliation Corresponding, Address, To whom, etc.
ExtraDataInLowerAffiliation Mail, fax, tel, etc.
Date January, February, 2000, etc.
Publisher Elsevier, John Wiley, etc.
JournalName Diabetes, endocrinology, etc.


Table 5.2 Features used in automated labeling
Zone Features Variable Names
Geometric Features:
Zone coordinates TopCoordinate, BottomCoordinate, LeftCoordinate, RightCoordinate
Zone height and width HeightOfZone, LengthOfZone
Median value of height, length and space of lines MedianLineHeight, MedianLineLength, MedianLineSpace
Difference between the bottom and top coordinates of the bottom-most and top-most zone HeightOfArticle
Zone order in sequence of top left edge ZoneOrder
Non-Geometric Features:
Biggest and smallest font sizes in an article MaximumFontSize, MinimumFontSize
Number of text lines NumberOfLine
Number of characters and words NumberOfCharacter, NumberOfWord
Number of capital characters NumberOfCapitalCharacter
Dominant font attribute and font size FontAttribute, FontSize
Confidence of characters Confidence
Number of "M.D.", "Ph.D.", "RN", etc. NumberOfDegree
Number of middle names, "Jr", "Sr", "II", etc. NumberOfMiddleName
Number of city, state, country, school, etc. NumberOfAffiliation
Number of "abstract", "summary", etc. NumberOfAbstract
Number of "keywords", "index words", etc. NumberOfKeyword
Number of "review", "article", etc. NumberOfHeadtitle
Number of "received", "accepted", etc. NumberOfReceived
Number of "Introduction", "Introduzione", etc. NumberOfIntroduction
Percentage of academic degrees per word PercentOfAcademicDegree
Percentage of middle names per word PercentOfMiddleName
Percentage of affiliations per word PercentOfAffiliation
Percentage of capital characters per zone PercentOfCapitalCharacter

5.1 Definition of layout types

As noted, the MEDLINE database contains bibliographic records from over 4,300 journals. The physical layout of the first page of articles in these journals, and the order in which the five important zones (title, author, upper affiliation, lower affiliation, and abstract) appear on the first page may be used to categorize the zone labeling type for a given journal. Figure 5.2 shows examples of common layout types consisting of a single column, or a combination of single and multiple columns. The numbers in the gray blocks indicate block numbers to help with the definitions of the more common zone labeling types described in Table 5.3.


Figure 5.2 Examples of common journal layout types. (a) Layout type 1; (b) Layout type 11; (c) Layout type 12; (d) Layout type 121; (e) Layout type 122.

The five important zones frequently appear in "first regular" or "second regular" zone order. In the "first regular" zone order, the title is near the top of the page, followed by author, affiliation in the upper part of the page (upper affiliation), and abstract. In the "second regular" zone order, the title is followed by author and abstract, with the affiliation appearing in the lower part of the page.

The zone labeling type for each journal is determined by the journal layout type and the zone order. For example, if the journal pages are of layout type 121 [Figure 5.2(d)] and the affiliation appears in block 4 (second regular), the zone labeling type is defined as Type 12006. Other labeling types are described in Table 5.3.

Table 5.3 Description of zone labeling types
Zone Labeling Type Includes Layout Type(s) Zone order(s) Description
Type 10000 1,11,12,121, 122 First regular Title, author, upper affiliation, and abstract are in block 1.
Type 10006 11 Second regular Title, author, and abstract are in block 1. Lower affiliation is in block 2.
  121 Second regular Title, author, and abstract are in block 1. Lower affiliation is in block 4.
Type 12000 12, 121 First regular Title, author, upper affiliation are in block 1. Abstract is in block 2, and may extend into block 3.
  122 First regular. Title, author, upper affiliation are in block 1. Abstract is in block 2
Type 12006 121 Second regular Title and author is in block 1. Lower affiliation is in block 4. Abstract is in block 2, and may extend into block 3.
Type 12200 122 First regular Title, author, upper affiliation is in block 1. Abstract is in block 2 and 3.

5.2 Rule-based algorithms

While all contiguous text regions on a page image are zoned, the zones of interest are the article title, author, affiliation and abstract. Since affiliation information could reside in the top part of the page as well as at the bottom, for labeling purposes, we define an "upper affiliation" and a "lower affiliation" zone. Hence, we have five possible labels. The remaining zones are labeled as "other". For each label type, there are four types of rules as shown in Table 5.4: rule types 1, 2 and 3 that are different for each label classification, and rule type 4 that is the same for all. Our rule-based algorithm consists of four steps.

In the first step, a probability of correct identification (PCI) is used in rule type 1. Every zone has five PCIs, one for each label. A PCI is equivalent to the probability of a zone possessing a particular label. The PCIs are derived empirically. For example, in the case of upper affiliation, when more than 30% of words in a zone belong to the affiliation word list, the PCI of upper affiliation is 100. Otherwise, PCI is equal to PercentOfAffiliation * 100/30. In case of author, when more than 28% of words in a zone belong to the list of middle names and academic degrees, the PCI of author is 100. Otherwise, PCI = (PercentOfAcademicDegree + PercentOfMiddleName) * 100/28. In this first step, when a zone has the highest PCI for a particular label, it is assigned that label. The PCI thresholds of 30% and 28% for affiliation and author respectively are established heuristically. In the case of author, we often find there are two authors in an author zone, each author name usually consists of three words, and "and" is located between the author names. We find that there are middle initials and academic degrees associated with author names. So, a zone is likely to be labeled as author when the ratio of the sum of academic degrees and middle initials to the total number of words in the zone exceeds 2/7 or 28.6%. In the case of affiliations, it has been determined that a zone is likely to be labeled as affiliation when 30% of the words belong to the affiliation word list.

In the second step, the labeling results from step 1 are rechecked by rule type 4. For example, when two zones are both labeled as author but one of those zones is located between title and upper affiliation, and the other is located between upper affiliation and abstract, the latter is removed from the author category.

In the third step, in addition to rule type 2, rule types 1 and 4 are applied again to make sure that at least one zone is labeled as title, author, abstract, upper affiliation or lower affiliation. For example, when a zone initially labeled as author does not contain information relevant to author (NumberOfMiddleName=0 and NumberOfAcademicDegree = 0), its location is then used to do the labeling. That is, its label as author is verified by the facts that (a) it does not contain information related to title or upper affiliation zones, and (b) it is located between title and upper affiliation zones.

In the fourth step, problems caused by zoning errors such as a zone split into multiple zones are handled by all rules, and any remaining unlabeled zones are labeled.

120 rules were generated for zone labeling types 10000, 10006, 12000, 12006, and 12200, and an example of detailed rules to detect upper affiliation is shown Table 5.5.

Table 5.4 Rule types used in automated labeling
Rule Type Description
1 Use Probability of Correct Identification (PCI). Each label has its own PCI equation. Example: When a zone has a high PCI for a label zone (PCI >= 100), the zone is assigned as the label zone.
2 When a label does not have any zone, which has PCI >= 100, pick a zone, which has the highest PCI for the label, and assign the zone as the label.
3 Some features should be similar within the same label zones. I.e., when a title zone is divided into two separate zones, the two zones have similar FontSize, FontAttribute, MedianLineHeight, etc.
4 TopCoordinate of title < TopCoordinate of author < TopCoordinate of Upper affiliation < TopCoordinate of abstract author < TopCoordinate of Lower affiliation
Table 5.5 Example of rules to detect Upper Affiliation
Rule Type Rule Description
1 1. TopCoordinate < HeightOfArticle /2
2. BottomCoordinate < HeightOfArticle * 3/4
3. NumberOfWord >= 2
4. NumberOfAcademicDegree < 3 or
PercentOfAcademicDegree < 30
5. NumberOfMiddlename < 3 or
PercentOfMiddleName < 30
6. PercentOfCapitalCharacter < 50
7. NumberOfHeadtitle == NumberOfAbstract == 0
NumberOfIntroduction==0
8. If all of above conditions are satisfied {
   If ( NumberOfAffiliation >= 2 ) {
    If ( PercentOfAffiliation >= 30 )   PCI =100;
     Else   PCI = PercentOfAffiliation * 100/30;
   }
   Else {
    If ( PercentOfAffiliation >= 30 ) PCI =50;
    Else   PCI = PercentOfAffiliation * 50/30;
   }
  }
  Else {
   PCI = 0
}
2 If ( PCI < 100 ), pick a zone having the highest PCI for upper affiliation.
3 1. If ( PCI > 25 and the next zone has NumberOfReceived ==1 )
PCI = 100.
2. Distance from a zone to upper affiliation zone is smaller than any other label zones.
3.FontSize, FontAttribute, MedianOfLineHeight, and MedianOfLineSpace of a zone must be similar to upper affiliation zone.
4 TopCoordinate of title < TopCoordinate of author < TopCoordinate of affiliation < TopCoordinate of abstract

5.3 Evaluation of rule-based automated labeling

Currently the AL module can reliably process 2,027 journal titles from the 4,300+ titles indexed in MEDLINE. Since NLM receives bibliographic data for 580 of these directly from publishers, the actual number of titles that may be processed by MARS is 1,447.

In Table 5.6 we show performance data for the month of February 2001 for 159 journal issues containing 2,524 articles processed by MARS. There were 101, 10, 37 and 11 journal issues in zone labeling types 10000, 10006, 12000, and 12200 respectively.

The data shows that 0.4% of the labeling errors is due to incorrect OCR output and 0.63% is due to poor zoning (AZ). The error rate attributed to the AL module itself is 0.20% when OCR and AZ are correct. The reason for the high error rate in the affiliation field is that text in this field is small sized and are frequently italicized, both factors contributing to poor detection by the OCR system. In overall performance, the AL module delivers an accuracy of 98.77%.

Table 5.6 Rule-based automated labeling performance
Error Type Label Field
  Title Author Affiliation Abstract Totals % of Error
Bad OCR 0 1 9 0 10 0.40
Automated Zoning (AZ) 2 8 6 0 16 0.63
Automated Labeling (AL) 1 3 1 0 5 0.20
Totals 3 12 16 0 31 1.23
% of Error 0.12 0.48 0.63 0 1.23  

5.4 Ongoing research in rule-based approach

As mentioned earlier, we used empirical methods to derive thresholds for the probability of correct identification (PCI) for each label, such as 28% and 30% of special word lists for PCI thresholds for author and affiliation. We plan to refine these figures by using statistical data, i.e., create histograms of every word list collected from the journals processed by MARS for each label zone, and select thresholds based on these histograms.

Our objective is to accommodate all journal titles indexed in MEDLINE, but we find that a number of these do not follow the relatively regular layout types that the system can process at present. Figure 5.3 shows examples of these irregular layouts. One approach to dealing with these irregular layouts is to develop a template matching algorithm based on the average font size and the average top-left and bottom-right coordinates of all important zones. These features will be stored in the database in a journal-specific manner. When a journal issue with irregular layout is processed, the AL module will read the zone coordinates and the font size of the text in the zone, and match them against the stored information.

(a) (b) (c)
Figure 5.3 Examples of irregular layout. (a) Abstract is on the left of the title. (b) Author and affiliation on the right of the title. (c) Author is on the left of the title.

5.5 Automated labeling: artificial neural network approach

In this section, we describe research toward automated labeling using integrated image and OCR processing, and a back-propagation artificial neural network. Basically, features are taken or calculated from OCR output and, after normalization, fed into the input layer of a neural net for label classification. Experimental results on a sample size of several thousand images of medical journal pages show that the system is capable of labeling text zones at a classification accuracy of 98%.

5.5.1 Zone features

Features for this labeling technique are based on page layout analysis for each journal, and generic typesetting knowledge for English text.54 Sixteen geometric and non-geometric features are considered here, as shown below:

Geometric Features:  
  Zone coordinates (left and top)  
  Zone dimensions (height and width)  
  Zone centroids (X and Y)  
  Zone area  
Non-Geometric Features:  
  Total characters Average font size
  Total boldface characters Total italic characters
  Total superscript characters Total subscript characters
  Total periods Total commas
   Total semicolons  

5.5.2 Method

To use a neural network model as a pattern classifier, its structure has to be designed, and the network trained. Here, we discuss the selection of training and testing data sets, a method to train and test the neural network, and the neural network structure design.

A back-propagation ANN for each journal type is designed, trained, and tested with journal-specific data. For each journal type, ("type" based on page layout and style characteristics that differ from one journal to another), a group of several journal issues is selected to create the training and test sets. For purposes of generalization, the cross-validation technique is used to divide these data sets. The training data set is used to design the ANN while the testing data set is used to estimate the classification accuracy. In addition to the labels of interest to us (title, author, affiliation, and abstract) all other zones are labeled as "others."

Training and testing data sets

Since each journal has its own page layout and style setting, we create a neural network for each journal type. For each type, a neural network is designed, trained, and tested with its own data. A group of at least four journal issues is selected to create the training and data sets for each journal type. Twenty-five different journal types consisting of 107 issues were selected for the experiment for a total of 2,948 binary images.

Cross-Validation method

For purposes of generalization, the cross-validation (CV) technique55 is used by randomly dividing the training data set into five data groups of which four data groups constitute a "CV-train" set and the one remaining group is considered the "CV-test" set. As a result, there are five pairs of a CV-train set and a CV-test set that are used to train and test the back-propagation neural network. Each neural network is trained and tested with each pair of a CV-train set and a CV-test set. The modified weights corresponding to the winning pair of a CV-train set and a CV-test set, the one yielding the highest classification accuracy, are chosen to be the final weights for the neural network.

Back-Propagation neural network

Back-propagation (BP)54-56 is a multi-layer ANN using sigmoidal activation functions. The network consists of an input layer, hidden layers, and an output layer, and nodes in each layer are fully connected to those in the adjacent layers. Each connection is associated with a synaptic weight. The BP network is trained by supervised learning, using a gradient descent method, which is based on the least squared error between the desired and the actual response of the network.

We implemented a two-layer BP network with an input layer of sixteen text zone features, a five output layer (title, author, affiliation, abstract, and others), and a single hidden layer in which there are 8 nodes. The two-layer BP network architecture may therefore be characterized as 16-8-5. Each input vector of the training data set is presented to the network multiple times and the weights are adjusted on each presentation to improve the network's performance until maximum performance is achieved. Two learning factors that significantly affect convergence speed, as well as avoid local minima, are the learning rate and the momentum. The learning rate determines the portion of weight that needs to be adjusted. Even though a small learning rate guarantees a true gradient descent,57 it slows down the network convergence process. The momentum determines the fraction of the previous weight adjustment that is added to the current weight adjustment. It accelerates the network convergence process. During the training process, the learning rate was adjusted to bring the network out of either its local minima (where the network has converged but its output error is still large) or its no-gain mode (the network mode in which its output error does not change, or changes very little over many cycles). The learning rate ranges from 0.001 to 0.1, and the momentum is 0.6.

5.5.3 Results

The BP neural network was trained with all five pairs of a CV-train set and a CV-test set. The average training time for each pair was about 8 hours. The network configuration associated with the winning pair was evaluated on the testing data set. The result was that the average classification accuracy on the testing data set was about 98.0%. Errors were due to inaccurate segmentation: for example, a zone of interest (such as title zone) split into multiple zones, as well as several different zones (such as author and affiliation zones) merged into a single zone.

5.5.4 Summary and conclusions

Automated labeling using a back-propagation neural network showed encouraging performance for 25 different journal types, and showed the possibility of extension to other journals. The label classification time is fast and the results are stable regardless of journal type. However, there are drawbacks as well. For example, it is difficult to use the geometric relations between labels as features, and it is time consuming to train the module and tune its learning parameters. It is also hard to analyze wrong labeling results. The most serious drawback is that the entire neural network must be retrained for new types of journals. We propose to continue this investigation with other ANN paradigms.

5.6 Automated labeling: template matching with page normalization

Observing that all first pages of articles in any journal follow the same general layout, we consider a template matching approach to label the zones. Simple template matching is unlikely to be successful because of geometric variability, but matching combined with page normalization proves to be a viable approach. Preliminary evaluation results using a sample size of several hundred images of biomedical journal pages show that our approach is capable of a label classification accuracy exceeding 96%.

In addition to geometric and content-based features derived from geometric zone information and zone contents respectively, we propose a new feature called "single and multiple column zone vertical area string pattern" to normalize the page images. After normalizing the pages, zone features are calculated, and then used to create several predefined types of vertical area string patterns for the entire page. Finally, zone features and vertical area string patterns are input into a template matching system for the final decision on label classification.

5.6.1 String patterns

Basic definitions necessary to describe our algorithms are given here.

Single and multiple column zone vertical areas

A single column zone vertical area of a binary image is defined as a vertical area in which only one text zone exists. A multiple column zone vertical area of a binary image is a vertical area in which more than one zone exists, and these zones are "vertically overlapped". Two zones are considered to be vertically overlapped if the top and/or the bottom coordinates of one zone are within the top and the bottom coordinates of another zone. Figure 5.4 shows an example of the single and multiple column zone vertical areas.

Vertical area string patterns

Let "M", "S", and "*" be the vertical areas of multiple column zone, single column zone, and empty line spaces. Let "C", "L", "R", "l", and "r" be the zone location features: Center, Left, Right, Left of Center, and Right of Center. Let "N", "Y" and "+" be No, Yes, and "Don't Care" respectively. There are two types of patterns: geometry-based and content-based patterns that are defined as follows.

Geometry-based "single and multiple column zone" vertical area string pattern

The "single and multiple column zone" vertical area string pattern is the combination of characters "M", "S", and "*" that represent the top-to-bottom vertical areas of a binary image. An example of this type of string pattern is shown in Figure 5.4(1) as "M*S*S*S*S*S*S*M*S".

Geometry-based "zone location" vertical area string pattern

The "zone location" vertical area string pattern is the combination of characters "C", "L", "R", "l", "r", and "+" that represent the relative location of a zone against the vertical middle line of a page. The following logic is used to determine the zone location in a page.

If | Zone Vertical Middle Line - Page Vertical Middle Line | is less than or equal to CENTER_THRESHOLD
   Zone is center "C".
Else if Zone Vertical Middle Line is less than Page Vertical Middle Line and Zone Right Coordinate is greater than Page Vertical Middle Line
   Zone is left "L".
Else if Zone Vertical Middle Line is greater than Page Vertical Middle Line and Zone Left Coordinate is less than Page Vertical Middle Line
   Zone is right "R".
Else if Zone Vertical Middle Line is less than Page Vertical Middle Line and Zone Right Coordinate is less than or equal to Page Vertical Middle Line
   Zone is left of center "l".
Else if Zone Vertical Middle Line is greater than Page Vertical Middle Line and Zone Left Coordinate is greater than or equal to Page Vertical Middle Line
   Zone is right of center "r".
End if

The CENTER_THRESHOLD is selected to be about two 12-point characters and for 300 dot per inch document, its value is about 100 pixels. The "zone location" vertical area string pattern of an image shown in Figure 5.4(2) is "++C+C+C+C+C+L+++C".

Content-based "single and multiple text lines zone" vertical area string pattern

The "single and multiple text lines zone" vertical area string pattern is the combination of characters "Y", "N", and "+". "Y" characters are for zones having more than one text line and "N" characters are for one text line zones. The "single and multiple text lines zone" vertical area string pattern of an image shown in Figure 5.4(3) is "Y+Y+Y+Y+N+Y+N+Y+N".

Content-based "Nth order font size zone" vertical area string pattern

The "Nth order font size zone" vertical area string pattern is the combination of characters "Y", "N", and "+". The smaller the order, the larger the font size. "Y" characters are for zones of which font sizes are categorized as Nth order and "N" characters are for zones not having Nth order font size. Examples of the "1st and 2nd order font size zone" vertical area string patterns are shown in Figures 5.4(4) and (5) as "N+Y+N+N+N+N+N+N+N" and "N+N+Y+N+N+N+N+N+N" respectively.

Content-based "Nth order percentage of capital characters zone" vertical area string pattern

The definition of "Nth order percentage of capital characters zone" vertical area string patterns is similar to the definition presented in the previous section. The difference is that the percentage of capital characters compare to total characters of a zone is used instead of the font size. The smaller the order, the larger the percentage. Figures 5.4(6)and (7) show the "1st and 2nd order percentage of capital characters zone" vertical area string patterns as "N+Y+N+N+N+N+N+N+N" and "N+N+Y+N+N+N+N+N+N" respectively.

5.6.2 Zone features

Features calculated for this technique are based on an analysis of the page layout for each journal. Here are 16 features for each zone and 2 features for the entire page:

Geometry-based zone features:  
   Zone coordinates (left and top)  
   Zone dimensions (height and width)  
   Zone location (center, left, right, left and right of center)  
Content-based zone features:  
  Zone content Total text lines
  Total characters Total capital characters
  Total punctuation marks Average font size
  Average line spacing  
Geometry-based page features:  
   Page content frame coordinates (left and right)  

The page content left/right frame coordinate is defined as the left-most/right-most coordinate of text zones in an image page.

Both geometric and content-based zone features are used to create several predefined types of vertical area string patterns for the entire page. These zone features and vertical area string patterns are then input into a template matching system for label classification.

The purpose of creating vertical area string patterns, especially the "single and multiple column zones" pattern, is to normalize the document image page. Generally, the number of text lines in a labeled zone such as title, author, affiliation, or abstract is different from one article to another in a journal issue and therefore the labeled zone coordinates of one article may not be the same as those of another article. As a result, using the same document style guide, the geometric page layout of one article may not be the same as that of another article in the same journal issue. In order to overcome this problem of irregularity, we propose a new feature called "single and multiple column zone vertical area string pattern" that will be used to normalize the page images.

As defined in section 5.6.1, the "single and multiple column zone" vertical area string pattern consisting of characters "M", "S", and "*" can be created by identifying vertical areas having single or multiple column zones from the top of a binary image to the bottom. Using this feature, we could have the same vertical area string patterns for document pages that use the same document style guide. Figures 5.4 and 5.5 show an example of two binary images having zone contents with different number of text lines but sharing the same "single and multiple column zones" vertical area string patterns.

5.6.3 Template matching algorithm

The template matching algorithm matches vertical area string patterns of a binary image to those of predefined layout document structures of a given journal type to derive two types of similarity classification features: degrees of geometry-based similarity and degrees of content-based similarity. If both similarity measures exceed a predefined weight threshold, the label classification of a predefined article page will be used to label zones of an arbitrary binary image. The following summarizes the template matching.

Set the weight matching to 0
If "single and multiple column zone" patterns are matched, add 100 points to weight matching.
  If "zone location" patterns are matched, add 50 points to weight matching.
  If 1st order font size zones is used and if its patterns are matched, add 10 points to weight matching.
  If 2nd order font size zones is used and if its patterns are matched, add 10 points to weight matching.
  If 1st order percentage of capital characters zones is used and its patterns are matched, add 10 points to weight matching.
  If 2nd order percentage of capital characters zones is used and its patterns are matched, add 10 points to weight matching.
  If the weight matching is at least 170 points
  Label zones using the predefined labels vertical area string patters to handle page layout classification.
  End if
End if

5.6.4 Predefined vertical area string patterns

For each journal type, a small set of article image pages are used to generate predefined vertical area string patterns and each pattern is labeled as title, author, affiliation, abstract, or "other". Since all labeled zones consist of text only, it is reasonable to automate the generation of the predefined vertical area string patterns along with their label classifications by matching the content of zones labeled by the user against that of an image. Let "1", "2", "4", "8", "0" be title, author, affiliation, abstract, and others. An example of predefined vertical area string patterns of an image article shown in Figure 5.4 using up to two orders is as follows (the numbers in parentheses corresponding to the strings in the figure):

"Single and multiple columns zone" (1)
M*S*S*S*S*S*S*M*S
++C+C+C+C+C+L+++C "Zone location" (2)
Y+Y+Y+Y+N+Y+N+Y+N "Single and multiple text lines zone" (3)
N+Y+N+N+N+N+N+N+N "1st order font size zones" (4)
N+N+Y+N+N+N+N+N+N "2nd order font size zones" (5)
N+Y+N+N+N+N+N+N+N "1st order percentage of capital characters zones" (6)
N+N+Y+N+N+N+N+N+N" 2nd order percentage of capital characters zones" (7)
0+1+2+4+0+8+0+0+0 "Label"

5.6.5 Results

Experiments were conducted with binary images selected from several different medical journals. A test sample consisting of 524 article page images from four different journal types was used, and the algorithm correctly classified 503 image pages, giving an accuracy rate of 96%. As in Section 5.5, errors were due to inaccurate segmentation leading to split or merged zones.

5.6.6 Summary and conclusions

The technique provides meaningful labels for article titles, authors, affiliations, and abstract with high accuracy, and showed the possibility of extension to other journals. Our approach using the proposed feature called "single and multiple column zones" pattern can successfully handle pages with the same document style guide but different geometric page layouts.

6. Automated reformatting

In many DIAU-based systems the syntax of the zone contents need to be reformatted to comply with conventions. In our case, we find that the text in the zones labeled as title, author and affiliation rarely conform to the syntactic forms required by MEDLINE's conventions. The research described here is directed to automatically rearranging this text to the desired formats to eliminate later manual correction.

The procedure relies on predefined rules. Rules for the title field retain the capital case for the first letter of the first word, and de-capitalize all the other words with the exception of acronyms. An example: "Medical Management of AIDS Patients" becomes "Medical management of AIDS patients," as required in MEDLINE. Rules for author field take into account characters that delimit authors in a multiple-author list; tokens to be eliminated, such as Ph.D., M.D.; tokens to be converted, such as II to 2nd; and "particles" to be retained, such as "van." For example, the author name appearing on the printed page as Eric S. van Bueron, Ph.D. becomes Van Bueron ES as required in MEDLINE.

For author and title reformatting, our algorithms use a subset of rules from the inclusive set of all rules. The selected rule set and the OCR output text are passed to the reformatting algorithm, and as each rule is applied, the OCR string is modified.

The reformatting strategy for the affiliation field is quite different from the above. The OCR data for an affiliation field could contain many affiliations, since each author may have a different affiliation. This data is often difficult to reformat. One reason is that only the affiliation of the first author is to be retained, in line with MEDLINE conventions. Another reason is that the desired data is spread out over the entire field and not contiguous. For example, in a 30 word affiliation zone, we may only want to retain words 1-8, 12-14, and word 30. Our method involves probability matching of the OCR output text to historical data of ~130,000 unique affiliations. In addition to this processing at the reformat stage, we attempt to improve the recognition of affiliations by lexicon-based methods described in Section 7.

6.1 Reformatting the Author field

Reformatting the author field uses forward chaining rules-based deduction. The reformat module can have many rules defined for a particular field. Each rule has a number of requirements among which are that it must

  • Be associated with a specific journal title (ISSN number);
  • Fall into one of eight categories as listed in Table 6.1. The categories are pre-defined in the reformat module and are required to help in our conflict resolution strategy, which in our case is specificity ordering. Whenever the conditions of one triggering rule is a superset of another rule, the superset rule takes precedence in that it deals with more specific situations. An example of this is shown later.

The third column in Table 6.1 shows the complete reformatted field. Note that a single rule or category does not necessarily complete the reformatting, but may need to be combined to achieve correct reformatting of the author field.

With the eight categories defined, the first step is to define which rules are appropriate for a particular ISSN (journal title), since the printed format varies widely among journals. As an example, in one journal the authors appear as:
Glenn M Ford, MD, John Smith, PhD, and John Glover

This can be difficult to parse with a default set of rules, such as ', and' and ',' so that other rules need to be defined. By defining, in the database, the rules for a specific journal title over a specific period of time we can customize the rules to work for unusual or specific cases.

The above example fails in the default rule set that only has ',' and ', and' as the author delimiters because this would incorrectly identify 'MD' and 'PhD' as author names. To accommodate this journal (and others like it) a high priority rule trigger list was created for author delimiters such as ', MD', ', PhD', 'Mr.', 'Dr.', and other formal titles.

To avoid conflict among rules, each word chain is passed through all the categories recursively until no more rules are triggered. As long as we have an antecedent with consequences we continue to process the word chain. Using the forwarding chaining method, when an "if statement" is observed to match an assertion, the antecedent (i.e., the if statement) is satisfied. When the entire set of if statements are satisfied, the rule is triggered. Each rule that is triggered establishes, in a working memory node, that it was executed. During conflict resolution the reformat module decides which rules take priority over others via specificity ordering. An example would be:
Reduce category executes on 'John Smith II' and makes this 'J S II'
Convert category executes 'John Smith II' and marks Smith as convert pre-word and 'II' to '2nd'.

Our conflict resolution method specifies that the convert category is more specific than the reduce category, thus keeping the word 'Smith' and '2nd'. In addition, the pre-word convert flag in this particular example signals the conflict resolution manager to keep 'Smith', initialize 'J', and append '2nd'. This is possible because we have retained our original text and the converted text. The text did not change and an integrated rule has informed us that the word 'Smith' has remained unchanged, and by examining all words, we deduce that this is the last name.

Example Before/After:
Before - John Smith II
After - Smith J 2nd

At the category level, the conflict resolution strategy is specificity ordering. There is also a conflict resolution strategy within a given category: priority list rule ordering. Rules within a given category are assigned a priority level to avoid conflicts. An example of this is the following list of authors appearing on the printed page:
Glenn Ford, John Smith, and David Wells

We have the following author delimiter rules defined: ',' and ', and'

However, the ',' is assigned priority 1, and the ', and' is assigned a higher priority 2. If we did not give a higher priority to ', and' we could end up with 'and' as part of the author name or create a null value.

In ground truth testing of the author reformat rules system we tested 1,857 authors from OCR data. Of that number, 41 were reformatted incorrectly, for a 97.29% correction rate. Of those 41, all 41 were missing rules defined for a given case. An example of a missing rule is given in the case of an author field that reads:
Glenn M. Ford, Jr., John Smith.

By simply adding the rule [', Jr. ' author delimiter priority 2], and with no changes in code, we achieved 100% correct reformatting in the test set.

6.2 Reformatting the Article Title field

The title field uses the same principles as in the author rules system, but requires fewer rules or categories. Of the eight rule categories required for reformatting authors, only three are needed to reformat titles: Uppercase, Lowercase and First Letter Upper.

6.3 Reformatting the Affiliation field

Institutional affiliations of the authors are reformatted by finding the best match between the OCR text and a list of about 130,000 correctly formatted affiliations obtained from the current production version of MARS. Simple string matching is not promising because of the myriad arrangements in which affiliations can be expressed. Most journals show the affiliations of all authors, but by convention only the affiliation of the first author is entered into MEDLINE. However, the text string corresponding to the first affiliation may be scattered throughout the OCR text for the affiliation field. As an example, when multiple authors are affiliated with different departments within the same institution, the printed affiliation may be "Department A, Department B, Department C, Institution XYZ," while the correct MEDLINE entry is "Department A, Institution XYZ." The problem is further confounded by OCR errors, especially errors in detecting superscripts and subscripts. To find a match, the entire OCR text of the affiliation field is compared with every entry in the list of existing affiliations. A matching score for each of the existing affiliations is calculated on the basis of partial token matches, distance between token matches and customized soundex matching. Tests show that our current version of affiliation reformatting successfully identifies the correct affiliation over 80% of the time when the affiliation is represented in the list. This success rate is expected to improve with parallel efforts to reduce OCR errors and the expansion of the list of affiliations from ongoing production data.

The first step is to read all these unique affiliations into memory and create a ternary search tree53 for each affiliation, after which we create a soundex word list58 for each affiliation.

When a zone is identified at the labeling stage as an affiliation field, the OCR data is first processed through a partial-matching algorithm. Low confidence characters are replaced with wildcards.

Example: Uniuersity. The 'u' is actually a 'v' but the OCR engine assigned it as a 'u' with a low confidence level. The partial match algorithm replaces the 'u' with a '.' signifying that this character is a wildcard, and that any word in our search tree that has the pattern Uni<any letter>ersity is considered to be a match.

The first step is to determine if a word in the affiliation zone matches one in the affiliation list. Ignoring implemented performance optimizations we perform a partial word match for all the words in the OCR list and build up a chain of those words that do match. We also track distances between chains.

Consider the example of trying to find the affiliation "Department of Computer Science, University of Maryland" in the affiliation list. The OCR input string might look like: "Department of Computer Science, Department of Engineering, University of Maryland, Department of Computer Science, Johns Hopkins University."

Since only the first affiliation is to be retained, there is considerable data that is irrelevant. The problem is to retrieve just the data needed. By word chaining we can find chains of words that exist in both the OCR text and in an affiliation zone and then use these to derive weighted probabilities. In this example there is a chain of 4 words that match, followed by 3 that do not match, followed by 3 more that match, and finally 7 that do not. Our probability algorithms compute chain word matches and distances between chained words.

The next step in our process reverses the partial word match. The ~130,000 affiliations are matched to the OCR affiliation.

Using the same example, "Department of Computer Science, University of Maryland" has 7 words and all 7 occur in our OCR word list. It is likely there is another affiliation entry that looks like "Department of Computer Science, University of Delaware". This would give a high match of 6/7 words. By comparing and weighting word matches from OCR to Corrected Affiliation and Corrected Affiliation to OCR, and using information such as the number of words matched, total number of words, chain of words matched, and chain of words unmatched, we arrive at a probability between 0 and 1. Note that partial matching is used to help cover OCR errors that would ruin a literal string pattern matching as the affiliation field is often in a smaller font and is likely to incur higher than normal OCR error rates.

In addition to a partial match search algorithm, a soundex algorithm is used with the addition of OCR substitution. For the example in which 'Uniuersity" has the 'u' as low confidence, a substitution table developed lists of common OCR errors where a u == v == y. All three letters are substituted in the low confidence 'u' position, and if a word matches with a soundex hash it counts as a match.

In our ground truth testing with affiliation zones,23 we found that if the OCR affiliation exists in our affiliation list of 130,000 entries, the probability that the affiliation match is the correct one is 88%. The affiliation reformat module picks the top 5 candidates which are presented for final text verification.

Table 6.1 Categories of Author Reformat Rules
Category Description Example
Particle Name Many names contain "particles" forming an integral part of the family name and possibly bearing significance to the family. A particle is retained as part of the reformatted author name. Etienne du Vivier becomes du Vivier E, where 'du' is a particle and is retained as is and preceding the last name Vivier. The first name is initialized.
Compound Compound family names are preserved in the form given and are often difficult to detect. We use a mix of rules to deduce it as a compound name. Most compound names use a hyphen. Those that don't can often use particle name rules to help preserve the compound name. L.G. Huis in 't Veld becomes Huis in 't Veld LGH.G. Huigbregtse-Meyerink becomes HuigBregtse-Meyerink HG
Convert Convert is a broad category that deals with general requirements to convert one pattern of text with another.J James A. Smith IV becomes Smith JA 4th
Religious Religious titles include Mother, Sister, Father, Brother. Names with surnames are handled differently from those that have no surnames. Surname example: Sister Mary Hilda Miley becomes Miley MH
No-Surname example: Sister May Hilda becomes Mary Hilda Sister
For translated articles, e.g., from the French, Soeur becomes Sister.
Reduce Reduction rules cover the elimination of text with a single author name. It also handles the Reduction of a person's given name and marking of the Surname if present. Mr. John Smith becomes Smith J
John Smith MD becomes Smith J
Lowercase Some fields present all data uppercase. This rule simply converts to lower case all text that is uppercase. JOHN SMITH becomes Smith J
First Letter Upper Title and Author at times will require that the first letter of a specific word be uppercased, depending on other rules. JOHN SMITH becomes Smith J
Author Delimiter Many articles are by multiple authors who contributed to the paper, such as this one. This rule takes an OCR stream of text and creates a word list, a chain of words, and delimits where a particular author begins and ends in the complete chain of words. Example1:Glenn M Ford, John Smithbecomes: Ford GMSmith J(, is the delimiter here)Example 2:Glenn M. Ford, John Smith, and Susan O'Malley becomes: Ford GMSmith JO'Malley S(', and' is the trigger, which must precede in priority ',' as a triggered rule)

7. Lexical analysis to improve recognition

DIAU techniques generally rely on OCR conversion of the document image. Overcoming the limitations of OCR is an important stage in most of these techniques, and lexical analysis approaches are often used for this purpose. Our work in lexical analysis techniques is motivated by two problems observed in production. The first problem was the excessive number of highlighted characters (which were actually correct, but assigned a low confidence level by the OCR system, and hence highlighted on the screen.) The second problem was the large number of character errors in the detected affiliations field, a consequence of small font size and italic attribute in the printed text in that field. Both problems placed an additional burden on the reconcile operators to correct and verify the text. Two modules, developed to solve these problems and reduce the operator labor, exploit the specialized vocabulary found in biomedical journals. While the modules use different techniques, both employ specially selected lexicons to modify the OCR text that is presented to the reconcile operators. (A detailed description of our design of the workstations used by the reconcile operators to verify the bibliographic record appears in our extended report.33)

7.1 Lexical analysis to reduce highlighted words

7.1.1 Problem Statement

The OCR system in MARS was selected for its high rate of correctly recognized characters (high detection accuracy) and the very low number of incorrectly recognized characters that were assigned a high confidence value (low false positives). Confidence levels lie in a range between 1 and 9. Trading off the low percentage of false positives, we found that over 90% of words containing low confidence characters are actually correct, and that these characters should have been assigned a value of 9 by the OCR system. To draw the reconcile operators' attention to characters that may need correction, all low confidence characters are highlighted in red on the reconcile workstation screen. When these are mostly correct, the operators are unnecessarily burdened by having to examine and tab through them. Figure 7.1.1 shows a portion of the reconcile screen, with characters highlighted incorrectly, i.e., with the original confidence values from the OCR system.


Figure 7.1.1.

In this example, part of the bitmapped image of the abstract field is displayed at the top of the screen and the corresponding OCR output text is displayed at the bottom. Although all of the OCR text is correct in this example, many characters are highlighted in red. Our objective is to reduce this number of highlighted characters.

7.1.2 Approach

We seek to reduce the number of (incorrectly) highlighted characters by automatically increasing the confidence level of characters detected correctly by the OCR system. Our approach is to locate each word in the title and abstract fields that contains any low confidence characters, check for the word in a lexicon and, if the word is found, change the confidence of all its characters to 9, the highest value.

A study was undertaken to determine criteria (heuristic rules) for selecting words to be checked and a lexicon suitable for biomedical journal articles. The key element of the study was the creation of a ground truth dataset with which to compare lexicons and lookup criteria. The ground truth data consisted of 5,692 OCR output words containing low confidence characters extracted from journals already processed by the MARS system. Each of these words was compared to the corresponding word in the final, verified bibliographic record created by MARS to determine if the OCR word was correct or not. Candidate lexicons and lookup criteria were evaluated with the goal of removing low confidence values from ground truth words that were correct, while retaining the low confidence values for those words that were not correct. Removing low confidence values from correct words is the "benefit" of the module. Removing low confidence values from incorrect words is the potential "cost" of the module.

7.1.3 Experiments and results

Four candidate lexicons were created from various word lists maintained by the National Library of Medicine with the expectation that these would contain a preponderance of the biomedical words found in journal articles indexed in MEDLINE. The four lexicons and their combinations were tested along with several lookup criteria involving word length and character confidence levels. As expected, there was a tradeoff between benefit and cost. A large lexicon and no lookup restrictions removed low confidence values from over 90% of the OCR correct words (a 90% benefit), but also removed low confidence values from over 60% of the OCR incorrect words (a 60% cost). To ensure the integrity of the final text, it was considered on balance more important to minimize cost than to maximize benefit.

Three combinations of lexicons and lookup criteria resulted in acceptable costs of less than 0.5% and benefits greater than 40%. The final choice correctly removed low confidence values from 46% of the correct OCR words and incorrectly removed low confidence values from 0.4% of the incorrect OCR words. The selected lexicon consists of unique words derived from NLM's SPECIALIST Lexicon and UMLS Metathesaurus. There are two levels of lookup criteria: 1) Words less than four characters in length, or containing no alphabetic characters are not checked. 2) Words less than six characters in length are not checked if any of the confidence values are less than 7. All other words containing low confidence characters are compared to the lexicon. If the word is found, the confidence values for all the characters are changed to 9, the highest value.

7.1.4 Implementation

The lexicon checking module (Confidence Edit) was implemented at two phases of the project, originally for the first generation production system (MARS-1) and later for the current system. In MARS-1 we found that lexicon checking reduced the highlighted words on average from approximately 14% of the words presented for verification at the reconcile workstation to approximately 6.5%. This 50% reduction in highlighted words resulted in a 4% increase in production rate, and was reported in the literature60.

For the current (MARS-2) system we added 9,386 words to our original lexicon. These were obtained by extracting all the words found in the verified and corrected abstracts from over 27,000 journals (= 230,000 articles) processed from May 1997 to April 2001, and using the frequency of occurrence of each word during that period. New words occurring at a frequency of 50 or more were added to the lexicon. Remaining words that occurred at least twice were checked against nine electronic dictionaries that are available to NLM. If the word was found in Dorland's Medical Dictionary, in the Oxford Medical Dictionary, or in at least six other dictionaries, it was added to the lexicon. When the new lexicon was tested with the original ground truth data, a 4% improvement in benefit with no increase in cost was measured. Using the expanded lexicon, Confidence Edit has reduced the percentage of highlighted words in the abstract field to approximately 3.5%. Figure 7.1.2 illustrates the effect of processing by Confidence Edit on the same document shown in Figure 7.1.1. Most of the characters are no longer highlighted.


Figure 7.1.2 Improvement due to Confidence Edit module

7.2 Lexical analysis to improve recognition of Affiliations

7.2.1 Problem Statement

As noted previously, although the commercial OCR system used by MARS performs well in general, accuracy is often poor for small fonts or italic characters. In particular, authors' affiliations frequently appear as small and/or italic characters, resulting in many incorrect characters in the affiliation field. Consequently, the final check and correction of the affiliation field requires a disproportionate amount of human labor compared to other fields extracted by our automated system. We observed that for about one in five affiliations, there were so many highlighted words that the operators preferred to type the entire affiliation rather than examine and correct each word. Not only was this time consuming, but also represented a potential source of error in the completed bibliographic record. In this section we describe the experiments that led to the design and implementation of the PatternMatch module that corrects words in the affiliation field.

7.2.2 Approach

Words that frequently appear in the affiliation field in biomedical journals are drawn from a relatively small vocabulary that denote institutions and their divisions, such as University and Department, the various branches of medicine and biology, such as Pathology and Biophysics, and names of cities, states and countries. A study was undertaken to determine if partial string matching or other matching techniques could exploit the limited vocabulary to reliably find the correct word from a lexicon of affiliation words given an OCR output word containing low confidence characters.

As in the previously described problem, a key component of this study was a ground truth dataset to compare matching techniques and lexicons. Words containing low confidence characters and the confidence values for all characters were extracted from the OCR output text in the affiliations field of over five thousand journal articles processed by the MARS system. Human operators selected the corresponding correct word from the affiliation fields of the completed bibliographic records. After words that contained no alphabetic characters were removed, the ground truth data consisted of over 20 thousand triplets, where a triplet consists of an OCR output word, the confidence values for the characters in the word and the correct word. Over 60% of the OCR words in the ground truth set are correct.

Correct words and a count of their occurrences were extracted from the final, corrected affiliation field of approximately 230,000 journal articles that had already been processed by the MARS system. This set of journal articles is different from the set used to create the ground truth data. There were 96,982 unique words of 2 or more characters that occurred one or more times in this historical data. This list of words is the basis for candidate lexicons of affiliation words.

7.2.3 Experiments

Six matching techniques61 were investigated using the ground truth data and the complete lexicon of affiliation words. Our goal was to find a technique to achieve a high match rate, a low false positive rate and fast processing. Most of the techniques return more than one potential match from a lexicon. If the correct word is among the list of returned words, it is considered a match. If no words are returned, it is considered no match. If words are returned, but none of them are the correct word, it is considered a false positive. The six techniques tested are:

Whole Word Matching. This technique compares the entire OCR output word with each word in the lexicon. Either one word is returned, or none is. Because over 60% of the OCR words in the ground truth set are correct, the match rate was reasonably high. However, when using the complete lexicon, the rate of false positives was also high because a single OCR error or omission can result in a word that is found in the lexicon. For example, several non-English variations of the word University are included in the lexicon, including Universit, Universita, Universitaire, Universitat, and Universite. If the OCR word is "Universit", whole word matching will find "Universit" even though the actual word was "University" or one of its other variations.

Partial Matching with wild card letters. In this technique, a match is sought with the "wild card" character '.' (a period) substituted for one or more of the low confidence characters in the OCR output word. Thus Partial Matching can find words in the reference dictionary where the OCR word has one or more character errors, but whose length is correct. For example, if we have an OCR word, Deparlmemt, with confidence values, 9699878956, a match would be found for Depar.me.t or D.par.me.t, but not for D.parlme.t or D.par.memt. For the same reasons as for Whole Word Matching, the false positive rate was high. In addition, the method does not find a correct match if the number of characters in the OCR word is incorrect.

Near-neighbor Matching. This technique finds all of the words in the lexicon that are within a given Hamming distance of the OCR word. Hamming distance is a measure of the difference between two finite strings of characters, expressed as the number of characters that need to be changed to obtain one from the other. For example, "Deparlmemt" and "Department" have a Hamming distance of two. A high match rate could be achieved by specifying a large Hamming distance, but this also resulted in a high false positive rate. Near-neighbor Matching also fared poorly when the OCR word length was incorrect.

Soundex Matching. Soundex matching finds words in the lexicon that are phonetically similar to the OCR output word. This technique proved to have a number of difficulties in overcoming character substitutions caused by the OCR system. For example, the word Department would often be interpreted as Deparlment by the OCR engine. These two words are phonetically quite different since the letters 't' and 'l' do not have similar sounds, and the correct word, Department, would not be returned.

Bi-gram Search. This technique was adapted from software developed inhouse to suggest spelling alternatives for online Library clients. A bi-gram is a pair of adjacent characters in the word being analyzed. For example, Department contains nine bi-grams: De, ep, pa, ar, rt, tm, me, en, nt. For bi-gram searches, each word in the lexicon is searched for all possible bi-grams in the OCR word. Lexicon words containing multiple bi-grams in the OCR word are possible matches. Long OCR words can result in a large number of possible matches. The bi-gram prunes out unlikely candidate words through an algorithm that considers lexicon word length, OCR word length and the number of matching bi-grams. Bi-gram searches were less sensitive to OCR word length than some other techniques and resulted in a relatively high match rate. The false positive rate was also high, because one or two incorrect characters in the OCR word could result in many incorrect possible matches.

Probability Matching. Probability Matching62 was developed inhouse specifically to address the problem of poor OCR recognition of words in the affiliation field. The first step of this technique compares the OCR output word to every word in the lexicon using an edit distance based on OCR character substitution frequencies, and assigns a confidence value to each lexicon word based on the probability that the OCR word will be produced when the true word is the lexicon word. Each word receives a score that is the product of the calculated confidence and the frequency of occurrence in the lexicon. Words are then ranked according to this score, and a specified number of highest-ranking words are returned.

Probability Matching achieved the highest match rate, lowest false positive rate, and slowest processing time of the techniques tested. However it is slow: because it compares the OCR word with every word in the lexicon, it can take up to 1 second per word depending on word length, computer speed and lexicon size, with larger lexicons producing better matching, but slower processing.

Initial experiments suggested further study toward an acceptable multi-stage process in which "easy" words are matched reliably by a faster process, and the remaining words are matched using Probability Matching. Combinations of matching techniques and lexicon sizes were tested in an effort to reduce the false positive rate and the processing time while maintaining the high match rates that had been observed.

7.2.4 Results

The final choice was a cascaded matching process that capitalizes on the fact that over half of the OCR words are correct. It was found that trimming the lexicon to include only the most frequently occurring words significantly reduced the false positive rate of Whole Word Matching. Probability Matching with a larger lexicon for words not found by Whole Word Matching then yielded acceptable overall results with less impact on average processing time.

The first step in our cascade matching is Whole Word Matching with a lexicon of 1,948 affiliation words with an occurrence of 100 or more in the historical data. Step 1 correctly matches 45% of the ground truth data, with a false positive rate of 1.4%. The second step is Probability Matching with the entire lexicon of 43,030 affiliation words with an occurrence of 2 or more in the historical data. Step 2 correctly matches 77% of the remaining 55% of the ground truth data, with a false positive rate of 16.7%. The overall performance of cascade matching was a match rate of 86% (with the correct word ranked highest for 81% of the words), a false positive rate of 11% and an average processing time of approximately 250 ms on a 500 MHz Pentium III. The still high false positive rate has implications for the way that potential substitute words are presented to the reconcile operator.

7.2.5 Implementation

Word matching for low confidence words in the affiliation field was implemented through two software development efforts61. A separate console program called PatternMatch was developed to automatically parse low confidence OCR words from the identified affiliation field, submit the words to the cascade matching algorithm for possible correct words, and, if any words are returned, insert those words into the affiliation text following the original OCR word and specially tagged for final text verification. PatternMatch was developed in C++ in the Microsoft Visual Studio development environment. In addition to the MARS database for input and output records, PatternMatch requires three files: the large and small lexicons and the character substitution frequency matrix used by Probability Matching.

Additional software was developed for text verification to support the interpretation of the special tags placed in the affiliation text by PatternMatch and the display of the word choices to the reconcile operator. This is treated in detail in our extended document33. PatternMatch has enabled the operators to select first match choice 80.8% percent of the time, select one of the other match choices 8.8% of the time, no word from the list 6.7% of the time, and the original OCR word 3.7% of the time.

8. DIAU: advantage

Our expectation that the DIAU-based automated processes described in this report would substantially improve productivity is borne out by performance data. These results are calculated from instrumentation data recorded while the production system is operating. In Section 11 of our extended report,33 we present performance evaluation results, e.g., average time taken by automated and manual processes on a per citation basis; time taken to perform different functions in the scanning and edit operations; workload distributions; and other data in tabular and graphic forms.

Here we show the value of applying document image analysis and understanding to the problem of bibliographic data extraction from digitized documents. The bar chart below indicates the labor needed to produce bibliographic records by three methods, at increasing levels of automation: (a) the completely manual keyboarding operation (conventionally done by almost all database producers), (b) the first generation MARS-1 system that used scanning and OCR for converting the abstract, with all other fields entered manually, and (c) the current (MARS-2) system that employs DIAU-based modules for data extraction. Instrumentation data is used to compute the figures for the MARS systems, and data as provided in contractor reports for the keyboarding operation.

In Figure 8.1, the data is presented as the labor-minutes taken to produce 600 records, one day's workload. It may be seen that the MARS-2 system, as a consequence of its automated processes, needs less than 25% of the labor of the keyboarding operation.


Figure 8.1.

9. Next tasks

In addition to the next steps outlined in individual sections, we seek to initiate new projects: creating a ground truth database for research in document image analysis and understanding; techniques to extract bibliographic data automatically from online journals; and an alternative method for text verification.

9.1 Ground truth data: PathFinder

9.1.1 Introduction and objective

Research in document image analysis is greatly dependent on ground truth data for the design, training and testing of algorithms for data identification and extraction. However, ground truth datasets and their associated analysis and visualization tools are usually created to analyze problems in specific applications and datatypes: skewed document images (Okun et al.),63 handwritten documents (Cha and Srihari),64 video sequences (Doermann and Mihalcik),65 statistical data (Swayne et al.),66 and speech signals (Barras et al.).67 Moreover, apart from the domain-specific nature of these datasets and tools, they are usually limited as to operating platforms and data formats, as described in an excellent taxonomy on this subject by Kanungo et al.68

To our knowledge no ground truth dataset exists that represents the corpus of biomedical journals, and none with the goal of extracting the text representing the bibliographic fields descriptive of the articles within these journals.

In meeting this challenge, the main objective of the PathFinder (Public Archive To Help Find New Designs for Expert Ratiocination) project is to exploit the vast amounts of document images and OCR-converted and operator-verified data, collected in the MARS project, to aid in the development of innovative and efficient algorithms for automatic zoning, labeling and reformatting by the computer science and medical informatics communities. This is to be done by developing a ground truth database accessible via the Web. This ground truth data will include page, zone, line, word, and character level information. In addition to providing a public site for researchers worldwide to develop and test their algorithms, we propose a tool to enable them to graphically visualize the ground truth data and employ an automated analysis assistant. Code-named Rover (gROundtruth Vs. Engineered Results), this automated analysis assistant will compare the results of a researcher's program to the ground truth data. The ground truth and results data will be in XML, and Rover will be written in Java. The overall website development will use MacroMedia Dreamweaver UltradDev 4 to provide a rich interface and extensive database connectivity.

9.1.2 Design considerations

Page layout. Identifying geometric features to design algorithms for automated data extraction is a non-trivial task since (as discussed earlier in this report) there is a variety of layout geometries in the 4,300 journal titles indexed in MEDLINE, though most follow the reading order paradigm of article title-author names-author affiliation-abstract. Ground truth data must therefore include samples of all significant layout types described in Section 5.

Data format. XML for images, OCR-converted data and operator-verified data since XML excels in adaptation (to accommodate changes in data), maintenance, linking (from one piece of data to another), simplicity, and portability over networks, operating systems and development environments.72-74

Modifying existing data. Despite its undeniable value, the existing data has certain deficiencies. For example, although OCR output characters in error are corrected, their attributes (e.g., italics or bold) are not. But these attributes can serve as features to create algorithmic rules to correctly identify zones or labels. These will be corrected before inclusion in the ground truth dataset.

9.1.3 Rover: a visualization and analysis tool

Once the data is available in XML format along with the original TIFF image, researchers need the functionality to:

  • Load a TIFF image and the corresponding XML file into an application, and display the XML data, where appropriate, overlaid on the image.
  • Modify and add new ground truth data.
  • Compare the results of new algorithms against the ground truth data.

The first two features are provided by TrueViz, described in a survey of visualization tools68. TrueViz is a public domain tool developed at the University of Maryland for visualizing, creating and editing ground truth and metadata. It is implemented in Java and works on Windows and Unix platforms (specifically, it has been successfully tested in the Windows 2000 and Sun Solaris 2.6 environments). It reads and stores ground truth and metadata in XML format, and reads the corresponding TIFF images. It has the ability to allow the user to inspect ground truth data at many levels, viz., at the page, zone, line, word, and character levels, and provides pertinent information at each level. For example, at the character level, such information includes the character code, font type and style, and bounding box (x1,y1,x2,y2) coordinates.

TrueViz is a suitable platform to initiate the design of Rover. To serve as an effective analysis assistant, Rover will extend TrueViz to provide researchers the capability to compare their XML data to ground truth XML data graphically (Figure 9.1.1), and thereby help them iteratively refine their algorithms. In addition, Rover will provide statistics and visual presentations on specified areas. For example, the user would be able to use Rover to compare all characters in the dataset that are bold, and to enumerate errors. Rover would visually locate the mistakes and report statistics on the query. In this example it would report the number of bold characters, the number detected correctly, and the number detected incorrectly, both as absolute numbers and as percentages. Rover would also export this information to a database or spreadsheet for further analysis.


Figure 9.1.1 Rover screen snapshot of TIFF page and zone segments.

9.1.4 Ground truth distribution

A website will provide access to the ground truth data, though it will go beyond serving as a simple data repository. Our objective is to encourage researchers to share and exchange ideas, as well as provide feedback to our development team for new desirable features. Figure 9.1.2 shows the organization of the website and how the elements interconnect. This section presents an overview of the website layout and functionality.


Figure 9.1.2 - Main sections of Website and tools

Product introduction. A Macromedia Quick Flash "movie" is displayed to the users, though this may be bypassed. We have already developed this movie and a demonstration of it is available at www.geocities.com/egalthan/Movie1.html. The metaphor used is a Sherlock Holmes style magnifying glass moving over a rare biomedical document.

User login. On the main page of the website, users are asked for a name and password. This data will be stored in a SQL Server 2000 database along with the website data. We intend to use Macromedia Ultradev as the design tool, because it readily allows for password protection and database connectivity. While we do not anticipate restricting anyone from accessing this site, we intend to require all users to register. When registered users log in, they will be taken to the main user page. Unregistered users will be presented with a registration page before being allowed access.

New user registration. The system will require first time users to enter a few important items of information that will enable us to provide security to the system as well as the ability to contact users with important new data or features added to the website in the future. The registration page will be developed using Ultradev which allows the developer a graphical design environment and provides rapid page development.

Data access. Once logged in or registered, the user would have access to all the ground truth data and tools. Since the data collection will be quite large, the system will allow users to download the entire data set or any subset they choose. For example, some users may only be interested in certain types of journal layouts, such as double column abstracts.

Data analysis. The users will be provided a link to launch Rover. While the initial version of this tool will possess the current functionality of TrueViz, the complete analysis support discussed earlier will be designed in a later phase.

Bulletin board. This is for the user community to report bugs, and use as a forum for discussions related to algorithmic development. Here the users may also upload and download files, such as technical papers written, algorithm source code, and new ground truth data.

9.2 Data extraction from online journals

9.2.1 Background

With the rapid expansion and utilization of the Internet and Web technologies, there is an increasing number of online biomedical journals, some of which are being indexed in MEDLINE. Online journals offer new opportunities in data mining, content extraction, creation of database records and related applications. However, there are new challenges also, and DIAU research is needed to capture, classify, analyze, extract, modify, and reformat Web-based document information for computer storage, access, and processing. We propose research leading to the design and development of an automated system, temporarily code-named WebMARS for Web-Based Medical Article Records System, to create citation records for MEDLINE. This system69 will download and classify document articles on the Web, parse and label contents of each article, extract and reformat the bibliographic fields to comply with MEDLINE conventions.

9.2.2 Project objectives

The objectives are to research DIAU techniques leading to the design, development, evaluation and implemention of a practical system to automatically extract bibliographic data from online journals for the MEDLINE database.

9.2.3 Project significance

With the increasing role of online journals in biomedical publishing, it is imperative that methods be developed to automatically extract bibliographic data with a minimum of human labor. Two advantages of such a Web-based document analysis and content extraction approach compared to either manual keyboard entry or the scanning/OCR approach are saving labor costs and improving performance (speed and accuracy).

9.2.4 Approach

The functions in the proposed WebMARS system would be implemented by automated processes hosted by seven servers, and operators at two types of workstations. A high-level schematic is shown in Figure 9.2.1.

Briefly, the WebMARS system will work as follows. An operator downloads an online journal from a publisher's website, and the PDF or HTML files of the articles are sent to the file server. The PDF to HTML files conversion server performs text conversion. The article zones creation and labeling server extracts text zones from the HTML files, and labels these zones as belonging to "title", "author", "affiliation", "abstract", "pagination", "databank accession number", "grant number", etc. The article zone contents modification server extracts, modifies, and reformats the text according to MEDLINE database conventions. The MEDLINE article zones creation and Web-based reconciling workstation selects and combines the appropriate portions of zone contents to create records that are viewed by operators for validation and reconciling. The reconciling workstation operator then checks, proofreads, and corrects any remaining problems in the records including character and symbol validation, citation reformatting, field label selection and confirmation. The SGML-based MEDLINE citation records creation server generates text zones directly from journal publisher SGML information, if available, that can be used to further improve labeling process. Finally, the MEDLINE citation records uploading server uploads the completed citation records to the NLM's DCMS database for later indexing.

9.2.5. Subsystems

Each subsystem of the WebMARS system is described below.

Web files collection and classification workstation. This workstation downloads and classifies Web-based files as abstract, full text, PDF or image files70. It consists of two processes: downloading and classification. The first is based on WinInet software tool and a combination of the Breadth First Search algorithm and the Constraint Satisfaction method to traverse the Web page's links, recognize and generate the successors of the downloading pages. The second process relies on the contents of the hyperlinks (name and address) to classify the files. Details appear in our extended report.33

PDF to HTML files conversion server. Occasionally, publishers offer certain Web journal articles as PDF files. Since WebMARS requires the HTML file format, this conversion server subsystem is used to convert the PDF files into HTML files.

Article zones creation and labeling server. To parse and create text zones for each journal article based on its HTML file, and then label these text zones. The subsystem consists of three modules: the automated zoning module, the automated labeling module, and the automated updating module, discussed below.

The automated zoning module relies on HTML tags to parse and then to create text zones for each article of a journal issue.

The automated labeling module labels these text zones using a combination of statistics and fuzzy rule-based technology. Statistics are used to generate membership functions for the fuzzy rules-based method. Features and fuzzy rules derived for the automated labeling module are based on an analysis of the HTML layouts of journals. Both geometric and non-geometric features are considered here. Geometric features of a zone are based on location and order of appearance. Non-geometric features are derived from contents of zone, statistics, and font characteristics. Since text zones are characterized by the words they contain, word matching is an important operation in this module. For example, a zone has a higher probability of being labeled as "affiliation" when it contains words related to country, city, and school names. Fourteen word lists have been created, and the ternary search tree algorithm53 is used as a search engine for the word matching.

The automated updating algorithm performs a difference analysis between text zones generated by this subsystem and corresponding text zones corrected by operators during the reconcile stage. Statistical labeled text zone information obtained from this analysis can be used to improve the labeling accuracy.

Article zone contents modification server. This subsystem extracts, modifies, and reformats text in labeled zones according to MEDLINE database conventions. It consists of two modules: the label cleanup module and the cleanup coach module. The first prepares bibliographic record information for the reconcile operator, while the second operates on post-reconcile data to collect statistics in order to develop new rules to improve the performance of the first module. Details appear in our extended report.

MEDLINE article zones creation and Web-based reconciling workstation. This workstation selects and combines the appropriate portions of zone contents to create MEDLINE citation records for text verification.

MEDLINE citation records uploading server. This subsystem creates an XML file based on NLMCommon DTD71 for completed citation records of a journal issue and uploads the file to the NLM DCMS database for later indexing.

SGML-based MEDLINE citation records creation server. To exploit publisher supplied information for zoning and to improve the labeling process.

9.2.6 Summary

We propose research here to design a system that will automatically extract bibliographic records directly from online journals. A prototype has been evaluated using a sample of 28 medical journal Web sites. Preliminary results show the feasibility of our design to create MEDLINE citations directly and effectively using Web document pages.


Figure 9.2.1. WebMARS system diagram (Bold outlines show servers)

9.3 Alternative method for text verification

The conventional approach to verifying the text output of any OCR system, or a DIAU-based system such as MARS, is to present the text in the same sequence as it appears on the printed page, and to highlight the low confidence characters (in color) in the text words. The operator then "tabs" quickly from one suspected character to the next and make the necessary corrections. This conventional approach has some drawbacks. For example, the operator must detect the suspected character surrounded by a mass of correct text. Also, the text must be corrected as encountered, thereby breaking the rhythm of identifying incorrect characters.

An alternative method is proposed that may prove to improve operator productivity. Called Carpet Character Certification and Correction, or the "Carpet" method, it involves grouping like characters (drawn from a number of pages or journal issues at the same time) and displaying them together in groups in a single window, as shown in Figure 9.3.1. Each character appears in its "edit box." Only low confidence A - Z and 0 - 9 characters, of the same type, would be displayed in groups. The example shows a set of characters in the edit boxes, numeric 1's. A "5" in one of the bitmapped images above the boxes has been misread as a "1" as shown. Since context is important to detect poorly captured character shapes, the system must provide the display of the image fragment (a word or phrase) that provides the context in which the (presumably) incorrect character appears, as is shown. Such context will particularly help distinguish letters or numbers that appear similar, e.g., 1, I or 0,O.

The GUI for the Carpet system must possess the following functions:

  1. An Automatic Context display of the TIFF image must be available when any of the edit boxes is focused on.
  2. With a mouse click on an edit box, or using the F1 thru F9 keys, the operator should be able to correct the character.
  3. To continue, the operator should be able to select Next>> and load the next batch of 9 characters, or select <<Previous for a second look at the previous 9. By selecting Next>>, all characters are set to high confidence.
  4. Reset all characters displayed to low confidence, in case of an accidental clicking of the Next>> button.
  5. Click OK to stop processing.

We propose to develop the Carpet system using Visual C++ with the Kodak Image libraries. Then we will conduct a performance study using this system, and measure the residual error rate and the time taken for correcting and verifying all the characters from a complete journal issue at a time. Comparisons will be made to conventional text verification methods.


Figure 9.3.1. GUI for Carpet Character Certification and Correction

10. References

1. Thoma GR, Walker FL, Hauser SE. Biomedical document preservation by electronic imaging. In: Converting Information for WORM Optical Storage, Roth JP, ed., Meckler, 1990; 92-131.

2. Hauser SE, Thoma GR, Gass SI. Factors affecting the conversion rate of bound volumes to electronic form. Proc. Electronic Imaging '89, Boston MA, 1989; 1041-6.

3. Thoma GR, Hauser SE, Walker FL. Managing an archive of electronic document images. Proc. ASIS, vol. 26, 1989; 59-65.

4. Thoma GR, Cookson JP, Walker FL, Hauser SE. Interfacing optical disks to a document image storage and retrieval system. J. Imaging Technology, vol. 12, no. 5, October 1986; 288-92.

5. Thoma GR, Long LR, Berman LE. Design issues for a digital x-ray archive accessed over Internet. Proc. SPIE: Storage and Retrieval for Image and Video Databases II, vol. 2185; San Jose CA: 1994; 129-38.

6. Hauser SE, Hsu W, Thoma GR. Request routing with a back error propagation network. Proc. SPIE: Applications of Artificial Neural Networks IV, vol. 1965; Orlando FL: 1993; 689-95.

7. Thoma GR, Hauser SE, Le DX, Muller D, Walker FL. WILL: Advances in the management of interlibrary loan. Vol. 1. Internal technical report. Bethesda, MD: National Library of Medicine, Lister Hill National Center for Biomedical Communications, Communications Engineering Branch, September 1995; 26 pp.

8. Thoma GR, Hauser SE, Le DX, Muller D, Walker FL. WILL: Design of a standalone WILL unit. Vol. 2. Internal technical report. Bethesda, MD: National Library of Medicine, Lister Hill National Center for Biomedical Communications, Communications Engineering Branch, September 1995; 27 pp.

9. Walker F, Thoma GR. DocView: Providing access to printed literature through the Internet. Proc. 10th Integrated Online Library Systems Meeting, New York: Learned Information, Inc. May 1995; 165-73.

10. Thoma GR, Walker FL. Access to document images over the Internet. In: Rada R, Ghaoui C, eds. Medical Multimedia; Oxford UK: Intellect, 1995; 179-93.

11. Walker FL, Thoma GR. Multimode delivery of multimedia information. Proc. 14th National Conference on Integrated Online Library Systems, Medford NJ: Information Today Inc.; May 1999; 141-52.

12. Walker FL, Thoma GR. Access to document images over the Internet. Proc. 9th Integrated Online Library Systems Meeting, New York. May 1994; 185-197.

13. Le DX, Thoma GR, Wechsler H. Document image analysis using integrated image and neural processing. Proc. 3rd ICDAR'95, Montreal, Canada; Aug 1995; Vol. I, 327-30.

14. Le DX, Thoma GR, Wechsler H. Classification of binary document images into textual or non-textual data blocks using neural network models. Machine Vision and Applications, Issue 8, 1995, 289-304.

15. Le DX, Thoma GR, Wechsler H. Automated page orientation and skew angle detection for binary document images. Pattern Recognition, Vol. 27, No. 10, October 1994; 1325-44.

16. Le DX, Thoma GR. Automated portrait/landscape mode detection on a binary image. Proc. SPIE Symposium on Aerospace and Remote Sensing - Visual Information Processing II, Vol. 1961, Orlando FL, April 1993, 202-12.

17. Hauser SE, Cookson TJ, Thoma GR. Using back error propagation networks for automatic document image classification. Proc. SPIE Applications of Artificial Neural Networks IV, Vol. 1965, Orlando FL, April 1993, 142-50.

18. Le DX, Thoma GR. Document skew angle detection algorithm. Proc. SPIE Symposium on Aerospace and Remote Sensing - Visual Information Processing II, Vol. 1961, Orlando FL, April 1993, 251-62.

19. Le DX, Thoma GR, Wechsler H. Document classification using connectionist models. Proc. IEEE International Conference on Neural Networks, Orlando FL, June 1994, vol. 5; 3009-14.

20. Handley JC. Table analysis for multi-line cell identification. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 34-43.

21. Hu J, Kashi R, Lopresti D, Wilfong G. Table structure recognition and its evaluation. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 44-55.

22. Hurst M. Layout and language: an efficient algorithm for detecting text blocks based on spatial and linguistic evidence. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 56-67.

23. Tan CL, Yuan B. Document text segmentation using multi-band disc model. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 56-67.

24. Yuan Q, Tan CL. Page segmentation and text extraction from gray scale image in microfilm format. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 323-32.

25. Zhou YP, Tan CL. Hough-based model for recognizing bar charts in document images. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 332-40.

26. Lee S. Recognizing handwritten electrical circuit symbols with attributed graph matching. Structured Document Analysis, 1992, 340-358.

27. Reiher E, et al. A system for efficient and robust map symbol recognition. Proc. 13th IAPR, 1996, 783-7.

28. Yu Y, Samal A, Seth S. Automatic segmentation of engineering drawings with symbols and connections. Proc. 3rd IAPR Int. Conf. Document Analysis and Recognition (ICDAR '95), 1995, 791-4.

29. Srihari SN. Computer Text Recognition and Error Correction. IEEE Computer Society, 1984.

30. Fateman RJ. How to find mathematics on a scanned page. Proc. SPIE: Document Recognition and Retrieval VII, Vol. 3967, San Jose CA, January 2000, 98-109.

31. Lopresti DP. A comparison of text-based methods for detecting duplication in document image databases. Proc. SPIE: Document Recognition and Retrieval VII, Vol. 3967, San Jose CA, January 2000, 210-21.

32. Thoma GR. Automating data entry for an online biomedical database: a document image analysis application. Proc. 5th International Conference on Document Analysis and Recognition (ICDAR'99). Bangalore, India; Sept. 1999; 370-3.

33. Automating the production of bibliographic records for MEDLINE. An R&D report of the Communications Engineering Branch, LHNCBC, NLM. September 2001. 91pp. In http://archive.nlm.nih.gov/~thoma/mars2001.pdf

34. Jain AK, Yu B. Document representation and its application to page decomposition. . IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 20, 1998, 294-308.

35. Nagy G, Seth S, Viswanathan M. A prototype document image-analysis system for technical journals. Computer, Vol. 25, 1992, 10-22.

36. O'Gorman L. The document spectrum for page layout analysis. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 15, 1993, 1162-73.

37. Niyogi D, Srihari SN. Integrated approach to document decomposition and structual analysis. International Journal of Imaging Systems and Technology, Vol. 7, 1996, 330-42.

38. Farrow GSD, Xydeas CS, Oakley JP, Khorabi A, Prelcic NG. A comparison of system architecture for intelligent document understanding. Signal Processing-Image Communication, Vol. 9, 1996, 1-19.

39. Kanai J, Rice SV, Nartker TA, Nagy G. Automated evaluation of OCR zoning. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 17, 1995, 86-90.

40. Krishnamoorthy M, Nagy M, Seth S, Viswanathan M. Syntactic segmentation and labeling of digitized pages from technical journals. IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 15, 1993, 737-47.

41. Baird HS. Model-directed document image analysis. Proc. 1999 Symposium on Document Image Understanding Technology, College Park, MD: University of Maryland Institute for Advances in Computer Studies, 42-9.

42. Le DX, Kim J, Pearson GF, Thoma GR. Automated labeling of zones from scanned documents. Proc. 1999 Symposium on Document Image Understanding Technology, College Park, MD: University of Maryland Institute for Advances in Computer Studies; 219-26.

43. PrimeOCR Access Guide, Version 3.0. Prime Recognition, 1998.

44. Hauser SE, Le DX, Thoma GR. Automated zone correction in bitmapped document images. Proc. SPIE: Document Recognition and Retrieval VII, Vol. 3967, San Jose CA, January 2000, 248-58.

45. Hones F, Lichter J. Layout extraction of mixed mode documents. Machine Vision and Applications 7, 1994, 237-46.

46. Taylor S, Fritzson R, Pastor J. Extraction of data from preprinted forms. Machine Vision and Applications 5, 1992, 211-22.

47. Tsujimoto S, Asada H. Major components of a complete text reading system. Proc. IEEE, Vol. 80, No. 7, 1992, 1133-49.

48. Tateisi Y, Itoh N. Using stochastic syntactic analysis for extracting a logical structure from a document image. Proc. IEEE Int. Conf. Neural Networks, Vol. 2, 1994, 391-94.

49. Niyogi D, Srihari SN. An integrated approach to document decomposition and structural analysis. Int. Journal of Imaging Systems and Technology, Vol. 7, 1996, 330-42.

50. Le DX, Thoma GR. Page layout classification technique for biomedical documents. Proc. World Multiconference on Systems, Cybernetics and Informatics (SCI 2000); Orlando, FL; Vol. 10, July 2000; 348-52.

51. Le DX, Thoma GR. Automated document labeling using integrated image and neural processing. Proc. World Multiconference on Systems, Cybernetics and Informatics, Orlando, FL; Vol. 6, 1999; 105-8.

52. Kim J, Le DX, Thoma GR. Automated Labeling in Document Images. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 111-22.

53. Bentley J, Sedgewick B. Ternary search trees. Dr Dobb's Journal, April 1998, 20-25.

54. Nagy G. At the frontiers of OCR. Proc. IEEE, Vol. 80, No. 7, 1992, 1093-1100.

55. Hush DR, Horne BG. Progress in supervised neural networks - what's new since Lippmann? IEEE Signal Processing Magazine, 1993; 8-39.

56. Lippmann. An introduction to computing with neural nets. IEEE Acoustics, Speech and Signal Processing Magazine 4(2), 1987; 4-22.

57. Zurada JM. Introduction to Artificial Neural Systems, West Publishing Company, St. Paul, Minnesota, 1992.

58. Hall PAV, Dowling GR. Approximate string matching. ACM Computing Surveys, 1980; 12:381-402.

59. Ford GM, Hauser SE, Thoma GR. Automatic reformatting of OCR text from biomedical journal articles. Proc.1999 Symposium on Document Image Understanding Technology, College Park, MD: University of Maryland Institute for Advances in Computer Studies; 321-25.

60. Hauser SE, Browne AC, Thoma GR, McCray AT. Lexicon assistance reduces manual verification of OCR output. Proc. 11th IEEE Symposium on Computer-based Medical Systems. Los Alamitos, CA: IEEE Computer Society. June 1998; 90-5.

61. Ford G, Hauser SE, Le DX, Thoma GR. Pattern matching techniques for correcting low confidence OCR words in a known context. Proceedings of SPIE, Vol. 4307, Document Recognition and Retrieval VIII, January 24-25, 2001, pp. 241-9.

62. Lasko TA, Hauser SE. Approximate string matching algorithms for limited-vocabulary OCR output correction. Proceedings of SPIE, Vol. 4307, Document Recognition and Retrieval VIII, January 24-25, 2001, pp. 241-9.

63. Okun O, et al. An experimental tool for generating ground truths for skewed page images. Proc. SPIE Document Recognition and Retrieval VIII. Vol. 4307, San Jose CA, January 2001, 22-33.

64. Cha S-H, Srihari SN. Handwritten document image database construction and retrieval system. Proc. SPIE Document Recognition and Retrieval VIII. Vol. 4307, San Jose CA, January 2001, 13-21.

65. Doermann D, Mihalcik D. Tools and techniques for video performance evaluation. Proc. 15th International Conference on Pattern Recognition. Barcelona, Spain, September 2000, 167-70.

66. Swayne DF et al. Xgobi: interactive dynamic data visualization in the X window system. Journal of Computational and Graphical Statistics 7, 1998.

67. Barras C, et al. Transcriber: a free tool for segmenting, labeling and transcribing speech. Proc. 1st Int. Conf. Language Resources and Evaluation. Granada, Spain, May 1998; 1373-76.

68. Kanungo T, et al. TRUEVIZ: A groundtruth/metadata editing and visualizing toolkit for OCR. Proc. SPIE Document Recognition and Retrieval VIII. Vol. 4307, San Jose CA, January 2001, 1-12.

69. Le DX, Tran LQ, Chow J, Kim J, Hauser SE, Moon CW, Thoma GR. Automated medical records citation records creation for Web-based online journals. Proc. 14th IEEE Symposium on Computer-Based Medical Systems, Los Alamitos, CA: IEEE Computer Society. July 2001, 315-20.

70. Tran LQ, Moon CW, Le DX, Thoma GR. Web page downloading and classification. Proc. 14th IEEE Symposium on Computer-Based Medical Systems. Los Alamitos, CA: IEEE Computer Society. July 2001, 321-6.

71. "NLM to use XML and DTD for MEDLINE Data", 10/13/00, www.nlm.nih.gov/news/MEDLINEdata.html; the NLM Common DTD is given in www.nlm.hin.gov/bsd/licnesee.html.

72. Desai G and Fenner J. Unleash the power of XML. Imaging and Document Solutions, Vol. 9, No. 12, Dec 2000, 29-32.

73. Pearson G, Moon CW. Bridging two biomedical journal databases with XML: A case study. Proc. 14th IEEE Symposium on Computer-Based Medical Systems. Los Alamitos, CA: IEEE Computer Society. July 2001, 309-14.

74. World Wide Web Consortium, XML Working Group, "XML 1.0 Specification 10-February-1998", ed. T. Bray, J. Paoli, C.M. Sperberg-McQueen. Available in various forms: www.w3.org/TR/1998/REC-xml-19980210.

11. Questions for the Board

1. For continued research toward automated extraction of fields not handled by our current techniques, especially in journals that exhibit irregular layouts, on which of these paradigms would the Board suggest we focus: artificial neural networks (back propagation, radial basis functions, or others), advanced template matching, decision trees, Bayesian networks or others?

2. To the Board's knowledge are there commercial OCR systems, classifiers or OCR toolkits that have been reported to reliably capture (correction detection > 95%, false alarm rate < 0.1%) biomedical symbols and Greek letters?

3. Is the ground truth data we propose to make available to the computer science and informatics research communities likely to be useful?

4. Does the Board believe there are other applications of document image analysis and understanding techniques useful to NLM or the biomedical community?



    Return to top of page

CEB Home | CEB Projects | Related Work | Publications | Repositories | NHANES | Site Index

U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD 20894
National Institutes of Health | U.S. Dept. of Health and Human Services
Copyright information | Privacy policy | NLM Accessibility
USA.gov | Need a plug-in? | RSS

URL: http://archive.nlm.nih.gov/pubs/reports/mars2001-DIAU/mars2001-DIAU.php
Last updated May 29, 2003

Send questions or comments about this site to