| Skip navigation |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Automating the production of bibliographic records for MEDLINE5.2 Evaluation of automated zoning Following initial testing and refinement, the zoning algorithm was tested with a set of page images from 59 journal issues that would become the first set of journals to be processed by the MARS-2 system. Journals selected had a page layout in which the title, authors, affiliations and abstract were all in one column, and appeared on the page in that order. Table 5.2 summarizes the scores for the 295 images in this set. Overall, of the 1,180 possible zones of interest, the zone correction program generated 1,155 correct zones, for a correct rate of 97.9%.
5.3 Implementation Based on the low error rates achieved in testing, the automatic zone correction algorithm was implemented for the MARS-2 system. A C++ zone correction class was written in the Microsoft Visual Studio development environment. The class is incorporated with the ZoneCzar module that also includes the automated labeling function described in Section 6. 5.4 Performance in production The original zone correction algorithm has continued to evolve in response to feedback from production operators and to observations from continued internal evaluation. As more journal layout types are added to those processed by MARS-2, code to accommodate new circumstances has been added to the algorithm, but the overall design has not changed. For example, to correctly zone pages in which affiliations are found at or near the bottom of the page, usually in small-sized fonts, computed threshold values are different for lines and zones that begin at the bottom third of the page than they are for the rest of the page. Although performance has remained consistently good, we anticipate challenges as we increase the number of journal titles and layout types accommodated by MARS in the future. 6. Automated labeling Once the contiguous text regions in a bitmapped page image are zoned, the next step is to label the zones, i.e., identify each zone as one of the bibliographic fields of interest. The figure below shows the sequence of steps: the bitmapped TIFF image of the scanned page, the output of the automated zoning module (AZ) and the output of the automated labeling module (AL).
Image analysis techniques for document labeling proposed in the literature33-37 are based mostly on the layout (geometric) structure and/or the logical structure of a document. Hones et al.33 describe an algorithm for layout extraction of mixed-mode documents, and the classification of these documents as text or non-text. Taylor et al.34 describe a prototype system using a feature extraction and model-based approach. Tsujimoto et al.35 present a rule-based technique based on the transformation from a geometric structure to a logical structure. Tateisi et al.36 propose a method based on stochastic syntactic analysis to extract the logical structure of a printed document. They use simple rules to label documents into three classes. Niyogi et al.37 use a rule-based system to label newspaper contents into thirteen labels such as headline, text paragraph, photograph, and so on. These labeling techniques rely mostly on rule-based algorithms, but other mechanisms such as artificial neural networks (ANN) and decision trees are also investigated. One drawback to ANN and decision tree methods is that they need training as a pre-processing stage. That is, the algorithms need to be re-trained whenever a new document (in our case, a journal layout not seen previously) is encountered, and the training time is proportional to the number of journal titles to be processed. Not only is this time consuming, it also makes it difficult for exceptional situations to be handled quickly. In addition, these techniques pose difficulties in readily using geometric information, e.g., the geometry between zones. Rule-based algorithms, on the other hand, do not need re-training, can employ geometric information readily, and moreover, can accommodate exceptional cases (slight divergence from a known layout type) by the addition of new rules. Since the 4,300+ journal titles indexed in MEDLINE exhibit a wide range of layout types, such exceptional cases can occur frequently. An automated labeling system needs to handle a multiplicity of layout types and exceptional cases quickly, and without extensive pre-processing and training. Our research in this area focused on three approaches: the rule-based algorithmic approach, an ANN method, and a template-matching technique. Our experiments and findings are reported in the literature.38-40 Based on these experiments, we decided to implement our labeling system on rule-based algorithms since this approach delivered a high accuracy rate, high speed of execution, and furthermore was amenable to modification as new layout types were added. Our approach relies on data from the OCR system which delivers information at the zone, line and character level:
The OCR output data is used to generate geometric and non-geometric features that, in turn, are used to create rules. Geometric features are based on a zone's location, order of appearance, and dimensions. For example, the article title zone is usually located in the top half of the page, followed by author, affiliation and abstract, in that order. Non-geometric features are derived from the text contents of a zone, aggregate statistics, and font characteristics. For example, some zones can be characterized by the words in them, and the frequency with which they occur. In such cases, word matching is an important technique to generate non-geometric features in the AL module. For example, a zone has a higher probability of being labeled as "affiliation" when it has words representing country, city and school names. Also, a zone positioned between the words "abstract" and "keywords" is more likely to be an abstract than any other bibliographic field. Fifteen database tables containing word lists have been assembled as shown in Table 6.1. Table 6.2 shows examples of geometric and non-geometric features. Word matching relies on search algorithms such as hash tables, binary search tree, digital search tree, ternary search tree, etc. We chose the ternary search tree on account of its ability to yield both the time efficiency of the digital search tree and the space efficiency of binary search trees, and its ability to perform advanced searches such as partial-matching and near-neighbor search. Proposed by Bentley and Sedgewick in 1997, this technique has been used for several years for searching English dictionaries in a commercial OCR system built at Bell Labs.56
6.1 Definition of layout types As noted, the MEDLINE database contains bibliographic records from over 4,300 journals. The physical layout of the first page of articles in these journals, and the order in which the five important zones (title, author, upper affiliation, lower affiliation, and abstract) appear on the first page may be used to categorize the zone labeling type for a given journal. Figure 6.2 shows examples of common layout types consisting of a single column, or a combination of single and multiple columns. The numbers in the gray blocks indicate block numbers to help with the definitions of the more common zone labeling types described in Table 6.3. ![]() Figure 6.2 Examples of common journal layout types. (a) Layout type 1; (b) Layout type 11; (c) Layout type 12; (d) Layout type 121; (e) Layout type 122. The five important zones frequently appear in "first regular" or "second regular" zone order. In the "first regular" zone order, the title is near the top of the page, followed by author, affiliation in the upper part of the page (upper affiliation), and abstract. In the "second regular" zone order, the title is followed by author and abstract, with the affiliation appearing in the lower part of the page. The zone labeling type for each journal is determined by the journal layout type and the zone order. For example, if the journal pages are of layout type 121 [Figure 6.2(d)] and the affiliation appears in block 4 (second regular), the zone labeling type is defined as Type 12006. Other labeling types are described in Table 6.3.
6.2 Structure of AL module Figure 6.3 shows the structure of the AL module and its interaction with the MARS database whose tables contain information on every journal title (ISSN number). This information includes layout type, physical size, affiliation location, abstract type, feature type, and feature value. After page images from a particular journal issue are processed by the AZ module, and the journal title (ISSN) is identified to the JournalName table, the AL module retrieves all the relevant information from this table, and activates an AL algorithm related to the zone labeling type. The output of the AL module, the identification of the page zones, are written to the LabelRanking table in the database, for further downstream processing. ![]() Figure 6.3 Structure of automated labeling module
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CEB Home | CEB Projects | Related Work | Publications |
Repositories | NHANES | Site Index
URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_6.php
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||