CEB Projects |
|
| page | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |
Automating the production of bibliographic records for MEDLINE
5.2 Evaluation of automated zoning
Following initial testing and refinement, the zoning algorithm was tested with a set of page images from 59 journal issues that would become the first set of journals to be processed by the MARS-2 system. Journals selected had a page layout in which the title, authors, affiliations and abstract were all in one column, and appeared on the page in that order. Table 5.2 summarizes the scores for the 295 images in this set. Overall, of the 1,180 possible zones of interest, the zone correction program generated 1,155 correct zones, for a correct rate of 97.9%.
| Field | Error Type | |||||
|---|---|---|---|---|---|---|
| split | too big | too small | merged | totals | % images with an error in this field | |
| Title | 7 | 7 | 2.4 | |||
| Author | 1 | 4 | 5 | 1.7 | ||
| Affiliation | 4 | 5 | 9 | 3.1 | ||
| Abstract | 3 | 1 | 4 | 1.4 | ||
| totals | 15 | 0 | 0 | 10 | 25 | |
| % images with this error | 5.1 | 0 | 0 | 3.4 | ||
5.3 Implementation
Based on the low error rates achieved in testing, the automatic zone correction algorithm was implemented for the MARS-2 system. A C++ zone correction class was written in the Microsoft Visual Studio development environment. The class is incorporated with the ZoneCzar module that also includes the automated labeling function described in Section 6.
5.4 Performance in production
The original zone correction algorithm has continued to evolve in response to feedback from production operators and to observations from continued internal evaluation. As more journal layout types are added to those processed by MARS-2, code to accommodate new circumstances has been added to the algorithm, but the overall design has not changed. For example, to correctly zone pages in which affiliations are found at or near the bottom of the page, usually in small-sized fonts, computed threshold values are different for lines and zones that begin at the bottom third of the page than they are for the rest of the page. Although performance has remained consistently good, we anticipate challenges as we increase the number of journal titles and layout types accommodated by MARS in the future.
6. Automated labeling
Once the contiguous text regions in a bitmapped page image are zoned, the next step is to label the zones, i.e., identify each zone as one of the bibliographic fields of interest. The figure below shows the sequence of steps: the bitmapped TIFF image of the scanned page, the output of the automated zoning module (AZ) and the output of the automated labeling module (AL).
![]() (a) |
![]() (b) |
![]() (c) |
Image analysis techniques for document labeling proposed in the literature33-37 are based mostly on the layout (geometric) structure and/or the logical structure of a document. Hones et al.33 describe an algorithm for layout extraction of mixed-mode documents, and the classification of these documents as text or non-text. Taylor et al.34 describe a prototype system using a feature extraction and model-based approach. Tsujimoto et al.35 present a rule-based technique based on the transformation from a geometric structure to a logical structure. Tateisi et al.36 propose a method based on stochastic syntactic analysis to extract the logical structure of a printed document. They use simple rules to label documents into three classes. Niyogi et al.37 use a rule-based system to label newspaper contents into thirteen labels such as headline, text paragraph, photograph, and so on. These labeling techniques rely mostly on rule-based algorithms, but other mechanisms such as artificial neural networks (ANN) and decision trees are also investigated.
One drawback to ANN and decision tree methods is that they need training as a pre-processing stage. That is, the algorithms need to be re-trained whenever a new document (in our case, a journal layout not seen previously) is encountered, and the training time is proportional to the number of journal titles to be processed. Not only is this time consuming, it also makes it difficult for exceptional situations to be handled quickly. In addition, these techniques pose difficulties in readily using geometric information, e.g., the geometry between zones. Rule-based algorithms, on the other hand, do not need re-training, can employ geometric information readily, and moreover, can accommodate exceptional cases (slight divergence from a known layout type) by the addition of new rules. Since the 4,300+ journal titles indexed in MEDLINE exhibit a wide range of layout types, such exceptional cases can occur frequently. An automated labeling system needs to handle a multiplicity of layout types and exceptional cases quickly, and without extensive pre-processing and training.
Our research in this area focused on three approaches: the rule-based algorithmic approach, an ANN method, and a template-matching technique. Our experiments and findings are reported in the literature.38-40 Based on these experiments, we decided to implement our labeling system on rule-based algorithms since this approach delivered a high accuracy rate, high speed of execution, and furthermore was amenable to modification as new layout types were added.
Our approach relies on data from the OCR system which delivers information at the zone, line and character level:
| Zone level | Zone boundaries, number of text lines |
| Line level | Line boundaries, number of characters, average character height |
| Character level | 8-bit character code, confidence level (1= lowest, 9 = highest), bounding box, font size, font attribute (normal, bold, underlined, italics, superscript, subscript, and fixed pitch) |
The OCR output data is used to generate geometric and non-geometric features that, in turn, are used to create rules. Geometric features are based on a zone's location, order of appearance, and dimensions. For example, the article title zone is usually located in the top half of the page, followed by author, affiliation and abstract, in that order.
Non-geometric features are derived from the text contents of a zone, aggregate statistics, and font characteristics. For example, some zones can be characterized by the words in them, and the frequency with which they occur. In such cases, word matching is an important technique to generate non-geometric features in the AL module. For example, a zone has a higher probability of being labeled as "affiliation" when it has words representing country, city and school names. Also, a zone positioned between the words "abstract" and "keywords" is more likely to be an abstract than any other bibliographic field. Fifteen database tables containing word lists have been assembled as shown in Table 6.1. Table 6.2 shows examples of geometric and non-geometric features.
Word matching relies on search algorithms such as hash tables, binary search tree, digital search tree, ternary search tree, etc. We chose the ternary search tree on account of its ability to yield both the time efficiency of the digital search tree and the space efficiency of binary search trees, and its ability to perform advanced searches such as partial-matching and near-neighbor search. Proposed by Bentley and Sedgewick in 1997, this technique has been used for several years for searching English dictionaries in a commercial OCR system built at Bell Labs.56
| Table Name | Words in the Table |
|---|---|
| Rubric | Review, Orginal Article, etc. |
| KeyOfTitle | Study, case, method, etc. |
| Author | Smith, John, Kim, etc. |
| AcademicDegree | Ph.D., MD, RN, etc. |
| Affiliation | University, Department, Institute, etc. |
| Abstract | Abstract, Summary, Background, etc. |
| Structured Abstract | Aim, Result, Conclusion, etc. |
| Keyword | Keyword, Index word, etc. |
| Received | Received, Revised, Accepted, etc. |
| Introduction | Introduction, Introduzione, etc. |
| ExtraDataInAffiliation | Corresponding, Address, To whom, etc. |
| ExtraDataInLowerAffiliation | Mail, fax, tel, etc. |
| Date | January, February, 2000, etc. |
| Publisher | Elsevier, John Wiley, etc. |
| JournalName | Diabetes, endocrinology, etc. |
| Zone Features | Variable Names |
|---|---|
| Geometric Features: | |
| Zone coordinates | TopCoordinate, BottomCoordinate, LeftCoordinate, RightCoordinate |
| Zone height and width | HeightOfZone, LengthOfZone |
| Median value of height, length and space of lines | MedianLineHeight, MedianLineLength, MedianLineSpace |
| Difference between the bottom and top coordinates of the bottom-most and top-most zone |
HeightOfArticle |
| Zone order in sequence of top left edge | ZoneOrder |
| Non-Geometric Features: | |
| Biggest and smallest font sizes in an article | MaximumFontSize, MinimumFontSize |
| Number of text lines | NumberOfLine |
| Number of characters and words | NumberOfCharacter, NumberOfWord |
| Number of capital characters | NumberOfCapitalCharacter |
| Dominant font attribute and font size | FontAttribute, FontSize |
| Confidence of characters | Confidence |
| Number of "M.D.", "Ph.D.", "RN", etc. | NumberOfDegree |
| Number of middle names, "Jr", "Sr", "II", etc. | NumberOfMiddleName |
| Number of city, state, country, school, etc. | NumberOfAffiliation |
| Number of "abstract", "summary", etc. | NumberOfAbstract |
| Number of "keywords", "index words", etc. | NumberOfKeyword |
| Number of "review", "article", etc. | NumberOfHeadtitle |
| Number of "received", "accepted", etc. | NumberOfReceived |
| Number of "received", "accepted", etc. | NumberOfReceived |
| Percentage of academic degrees per word | PercentOfAcademicDegree |
| Percentage of middle names per word | PercentOfMiddleName |
| Percentage of affiliations per word | PercentOfAffiliation |
| Percentage of capital characters per zone | PercentOfCapitalCharacter |
6.1 Definition of layout types
As noted, the MEDLINE database contains bibliographic records from over 4,300 journals. The physical layout of the first page of articles in these journals, and the order in which the five important zones (title, author, upper affiliation, lower affiliation, and abstract) appear on the first page may be used to categorize the zone labeling type for a given journal. Figure 6.2 shows examples of common layout types consisting of a single column, or a combination of single and multiple columns. The numbers in the gray blocks indicate block numbers to help with the definitions of the more common zone labeling types described in Table 6.3.

Figure 6.2 Examples of common journal layout types. (a) Layout type 1; (b) Layout type 11; (c) Layout type 12; (d) Layout type 121; (e) Layout type 122.
The five important zones frequently appear in "first regular" or "second regular" zone order. In the "first regular" zone order, the title is near the top of the page, followed by author, affiliation in the upper part of the page (upper affiliation), and abstract. In the "second regular" zone order, the title is followed by author and abstract, with the affiliation appearing in the lower part of the page.
The zone labeling type for each journal is determined by the journal layout type and the zone order. For example, if the journal pages are of layout type 121 [Figure 6.2(d)] and the affiliation appears in block 4 (second regular), the zone labeling type is defined as Type 12006. Other labeling types are described in Table 6.3.
| Zone Labeling Type | Includes Layout Type(s) | Zone order(s) | Description |
|---|---|---|---|
| Type 10000 | 1,11,12,121, 122 | First regular | Title, author, upper affiliation, and abstract are in block 1. |
| Type 10006 | 11 | Second regular | Title, author, and abstract are in block 1. Lower affiliation is in block 2. |
| 121 | Second regular | Title, author, and abstract are in block 1. Lower affiliation is in block 4. | |
| Type 12000 | 12, 121 | First regular | Title, author, upper affiliation are in block 1. Abstract is in block 2, and may extend into block 3. |
| 122 | First regular | Title, author, upper affiliation are in block 1. Abstract is in block 2. | |
| Type 12006 | 121 | Second regular | Title and author is in block 1. Lower affiliation is in block 4. Abstract is in block 2, and may extend into block 3. |
| Type 12200 | 122 | First regular | Title, author, upper affiliation is in block 1. Abstract is in block 2 and 3. |
6.2 Structure of AL module
Figure 6.3 shows the structure of the AL module and its interaction with the MARS database whose tables contain information on every journal title (ISSN number). This information includes layout type, physical size, affiliation location, abstract type, feature type, and feature value. After page images from a particular journal issue are processed by the AZ module, and the journal title (ISSN) is identified to the JournalName table, the AL module retrieves all the relevant information from this table, and activates an AL algorithm related to the zone labeling type. The output of the AL module, the identification of the page zones, are written to the LabelRanking table in the database, for further downstream processing.

Figure 6.3 Structure of automated labeling module
| page | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 |











