| Skip navigation |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Document Image Analysis and Understanding R&D
October 2001 Contents
1. Background Research in document image analysis and understanding (DIAU) has been part of the Communications Engineering Branch's longstanding involvement in document and biomedical imaging from many different standpoints: image capture, storage and retrieval, lossy and lossless compression, image enhancement and other types of image processing. In particular, research in document imaging has been applied to the design and development of prototype systems to serve as testbeds for investigations into electronic archiving and preservation of library materials,1-4 automated interlibrary loan systems,5,6 on-demand document delivery,7,8 and Internet-enabled document delivery and management for the end user.9-12 In addition, we have engaged in several years of related and relevant R&D in document image analysis and understanding.13-19 The aim of document image analysis and understanding is to automatically recognize and extract textual or graphical material from digitized documents as closely as possible to the results of human action. This active research area includes work in table extraction and understanding,20-22 separating textual and graphical regions,23,24 extracting bar charts,25 electrical circuit symbology from circuit diagrams,26 symbols and text labels from geographic maps,27 symbols and connections from engineering drawings,28 zip codes from addressed envelopes,29 mathematical equations and formulas from other text,30 detecting duplicate documents from document image databases (Gulf War Declassification Project),31 and other application domains. The application we are interested in is the automatic extraction of bibliographic data from biomedical journals indexed in MEDLINE. Our principal targets are the article title, author names, institutional affiliations and the abstract, all items that usually appear on the first page of a journal article. The motivation for this investigation is the promise of labor savings compared to the conventional keyboarding of this data into MEDLINE, and also the more timely availability of this data.32 Most applications based on DIAU research involve multiple processes, starting with the scanning of the document page to produce a bitmapped image, the conversion of the image data to text by optical character recognition (OCR), page segmentation that blocks regions of contiguous text into zones, labeling that classifies (identifies) the zones, and reformatting that organizes the text in the zones in the formats desired by the application. Apart from the manual scanning stage, the four other stages are automatically done by daemon processes to yield the desired information. In addition, in any practical implementation of an image analysis system, verification of the processed data must be done manually, as must the entry of data that are important to the final product (in our case, a bibliographic record) but which cannot be automatically extracted, e.g., because they do not appear on the page that is scanned. DIAU techniques rely on image features directly extracted from the bitmapped images, non-geometric data from the OCR system and lexical analyses using string or word pattern matching. Our work exploits all of these approaches. In this report we focus on the techniques for page segmentation (automated zoning), automated labeling and reformatting. We also discuss the lexical analysis research that improves the recognition of words incorrectly detected by the OCR system. In addition, a comparison of level of effort between a completely manual approach to capturing bibliographic data and ours in which several stages are automated appears in Section 8. This is followed by an outline of next steps in this project. References to the literature appear at the end. Based on our research we have designed, developed and built a system code-named MARS for Medical Article Records System. In addition to its role as a testbed providing data for research, MARS is an operational production system delivering one-third of the NLM's total bibliographic data requirements. The system is described in a recent 91-page document "Automating the production of bibliographic records for MEDLINE" accessible on our website33. This document gives a comprehensive account of: the design considerations and implementation details of the automated modules; the design and organization of the central database that controls the workflow, and which stores all data (into and out of the various processes); the evaluation criteria and testing that led to the selection of the OCR system; the design and implementation of the operator-controlled workstations for scanning, editing, reconciling (verification), and necessary tools for production control and supervision; the design of research tools that aid the design of our algorithms; and a detailed breakdown of performance statistics. 2. Project objectives The objectives of this project are to:
3. Project significance Document image analysis and understanding is key to automated data extraction from digitized documents in many application domains, as noted in Section 1. At NLM, the application of immediate interest is the automated extraction of bibliographic data from digitized pages of biomedical journals indexed for MEDLINE. There are two principal reasons for this application: first, the gradual rise of labor costs; and second, the unrelenting increase in the amount of data that needs to be entered into databases from paper-based information. This is as true for MEDLINE as for the hundreds of databases produced in every discipline that rely on laborious keyboard entry of bibliographic information from articles in journals, e.g., article title, author names, institutions, abstract, dates, page numbers, etc. Image analysis and understanding techniques provide the basis for the development of automated systems that promise a cheaper alternative to keyboarding, and a more timely availability of bibliographic data for the public. 4. Page segmentation In most DIAU techniques for automated text extraction, the first step after the conversion of the bitmapped image by the OCR system is page segmentation: to block out ("zone") the regions of contiguous text, which in our application are those text groups corresponding to the bibliographic fields of highest interest to us: viz., article title, authors, affiliation, and abstract. Much of the research reported in the literature employ methods analogous to those used to isolate and separate characters (symbol isolation) to segment page images into zones. A brief survey of automatic zoning methods is given in Jain34. Approaches include "top-down" methods35, which segment a page by x-cuts and y-cuts into smaller regions, "bottom-up" methods36,37, which recursively grow homogeneous regions from small components, and combinations of both34,38. Tradeoff factors include: granularity (finding small enough zones), computation time, and sensitivity to input parameters such as noise, skew and page orientation.39-41 Top-down methods tend to be faster and less sensitive to input parameters and page orientation, but require pages to have a "Manhattan layout", i.e., the blocks may be separated by vertical and horizontal lines. Bottom-up and combination methods often result in greater accuracy at the expense of computational complexity and sensitivity to input parameters. All of these methods zone the page using image data alone, prior to OCR conversion. Since the reported performance is variable, and because rich secondary data is available from our OCR system, our approach, in contrast, is to exploit the output data of the OCR system to implement automatic zoning. The commercial OCR system used in the MARS system includes a package to perform automatic zoning, but with inadequate accuracy. The most common error is that zones are too large and include more than one significant text group. Figure 4.1 illustrates a typical case in which the title, author, abstract and affiliation are all in one zone, along with extraneous publication identification. Figure 4.2 illustrates another case where, in addition to the previous problem, a two-column abstract is grouped inappropriately into a single zone. In this example, the text lines in the two columns are joined, disrupting the proper reading order. For example, the middle text of the first line of the zone is incorrectly read as "..models have opment of..."
Correct zones are critical to downstream processes in systems based on DIAU. In our application, the stage that follows automated zoning is the automated labeling of the zones as title, authors, affiliation and abstract.42 This complex labeling process uses several items of information in each zone to determine its identity. Information used to label a zone include absolute and relative location of the zone, and key words within the zone. Clearly, the zone region must be correct if it is to provide useful information to the labeling algorithms. Downstream from automatic zoning and labeling, there is often a requirement for the text to be reformatted to comply with syntactic rules (in our case, MEDLINE conventions).59 This process also depends on correctly sized and labeled zones to be effective. Incorrect zones confound reformatting, ultimately requiring time-consuming manual intervention when the captured text is verified, thereby offsetting the advantage expected from an automated system. Since we cannot depend on the commercial OCR system to correctly zone images, and seek to eliminate manual zoning, we developed an automatic zoning capability. However, rather than starting from scratch, we combine the automated zoning capability of the OCR system with our added functionality for zone correction. 4.1 Methods and Procedures As noted earlier, in addition to ASCII text, the OCR system provides information about each of the converted characters in the output file. This information includes the level of confidence that the character was correctly recognized, character attributes such as italic or bold, character point size, and the x,y coordinates of the rectangle bounding the character (bounding boxes).43 Thus we have both geometric and non-geometric feature information available for each converted character. Our approach is to draw upon these features to group text into correct zones. For example, we use the bounding box coordinates to determine which characters are grouped closely in the same region on the page. Information on character size and attributes provide additional clues for keeping groups of adjacent characters together or placing them in separate zones. Our zone correction method uses both top-down and bottom-up design strategies,44 used by other investigators solely on image data, on our OCR output (non-image) data. Our procedure is outlined in Table 4.1.
Figure 4.3 shows an illustration of the method. The first step in creating new zones is to disassemble the original zones from the OCR system. Each zone is divided into individual text lines. In step 2, lines are further split horizontally into multiple lines when the space between words exceeds a distance threshold (empirically determined). This occasionally results in unnecessarily splitting lines into multiple parts, but is needed in order to split lines that originally span across two closely spaced columns. Some of these lines will be rejoined in later steps. The bounding box enclosing each line is computed, as are several features such as percent italic characters and average character height. Some character features, such as bold or italic, are available directly from the OCR output data. Others, such as character height or case (upper or lower), are computed from the OCR output data.
Figure 4.3 Zone correction program steps In step 3, we combine the lines vertically into initial zones. The criteria for combining are that (a) the vertical distance between lines must be less than a threshold (again, empirically determined); (b) either the left edge, right edge or midpoint must be horizontally aligned; and (c) the features computed in the previous step must be similar. When a line is added to a zone, the zone's rectangular boundary is expanded to include the new line. Then all remaining lines are checked to see if they fall within the new zone. If so, they are added to the zone. Many of the horizontally split lines are recombined in this way. On rare occasion, some zones created in step 3 are too narrow. In this event, the fourth and last step is to combine such zones horizontally using criteria similar to those in the previous step. Here, the initial zones are combined if (a) the horizontal distance between the zones is less than a threshold; (b) either the top or bottom edges of the zones are vertically aligned; and (c) the computed features of the two zones are similar. When zones are thus merged, a new zone boundary rectangle is created to include both zones. Any smaller zones that fall within the rectangle are subsumed within this zone. Figures 4.4 and 4.5 show the results of these steps applied to the two images used as examples in Figures 4.1 and 4.2. In both of these images, the title, author, affiliation and abstract are enclosed in separate zones, as required. In addition, in Figure 4.5, the two columns of the abstract are in separate zones. These two zones will be identified as abstract by the automated labeling process, which follows the zone correction process, and the enclosed text will be organized in the proper reading order.
5. Automated labeling The step following page segmentation is to label the zones, i.e., identify each zone as one of the bibliographic fields of interest. The figure below shows the sequence of steps: the bitmapped TIFF image of the scanned page, the output of the automated zoning module (AZ) and the output of the automated labeling module (AL).
Image analysis techniques for document labeling proposed in the literature45-47 are based mostly on the layout (geometric) structure and/or the logical structure of a document. Hones et al.45 describe an algorithm for layout extraction of mixed-mode documents, and the classification of these documents as text or non-text. Taylor et al.46 describe a prototype system using a feature extraction and model-based approach. Tsujimoto et al.47 present a rule-based technique based on the transformation from a geometric structure to a logical structure. Tateisi et al.48 propose a method based on stochastic syntactic analysis to extract the logical structure of a printed document. They use simple rules to label documents into three classes. Niyogi et al.49 use a rule-based system to label newspaper contents into thirteen labels such as headline, text paragraph, photograph, and so on. These labeling techniques rely mostly on rule-based algorithms, but other mechanisms such as artificial neural networks (ANN) and decision trees are also investigated. One drawback to ANN is that training is required as a pre-processing stage. That is, the algorithms need to be re-trained whenever a new document (in our case, a journal layout not seen previously) is encountered, and the training time is proportional to the number of journal titles to be processed. Not only is this time consuming, it also makes it difficult for exceptional situations to be handled quickly. In addition, these techniques pose difficulties in readily using geometric information, e.g., the geometry between zones. Rule-based algorithms, on the other hand, do not need re-training, can employ geometric information readily, and moreover, can accommodate exceptional cases (slight divergence from a known layout type) by the addition of new rules. Since the 4,300+ journal titles indexed in MEDLINE exhibit a wide range of layout types, such exceptional cases can occur frequently. An automated labeling system needs to handle a multiplicity of layout types and exceptional cases quickly, and without extensive pre-processing and training. In this section we report on our research focusing on three approaches: the rule-based algorithmic approach (Sections 5.1-4), an ANN method (Section 5.5), and a template-matching technique (Section 5.6). Our experiments and findings are also reported in the literature.50-52 Based on these experiments, we decided to implement our labeling system on rule-based algorithms since this approach delivered a high accuracy rate, high speed of execution, and furthermore was amenable to modification as new layout types were added. The structure of the AL module and its interaction with tables in the MARS database appear in our extended report.33 All three of our techniques rely on data from the OCR system which delivers information at the zone, line and character level:
The OCR output data is used to generate geometric and non-geometric features that, in turn, are used to create rules. Geometric features are based on a zone's location, order of appearance, and dimensions. For example, the article title zone is usually located in the top half of the page, followed by author, affiliation and abstract, in that order. Non-geometric features are derived from the text contents of a zone, aggregate statistics, and font characteristics. For example, some zones can be characterized by the words in them, and the frequency with which they occur. In such cases, word matching is an important technique to generate non-geometric features in the AL module. For example, a zone has a higher probability of being labeled as "affiliation" when it has words representing country, city and school names. Also, a zone positioned between the words "abstract" and "keywords" is more likely to be an abstract than any other bibliographic field. Fifteen database tables containing word lists have been assembled as shown in Table 5.1. Table 5.2 shows examples of geometric and non-geometric features. Word matching relies on search algorithms such as hash tables, binary search tree, digital search tree, ternary search tree, etc. We chose the ternary search tree53 on account of its ability to yield both the time efficiency of the digital search tree and the space efficiency of binary search trees, and its ability to perform advanced searches such as partial-matching and near-neighbor search.
5.1 Definition of layout types As noted, the MEDLINE database contains bibliographic records from over 4,300 journals. The physical layout of the first page of articles in these journals, and the order in which the five important zones (title, author, upper affiliation, lower affiliation, and abstract) appear on the first page may be used to categorize the zone labeling type for a given journal. Figure 5.2 shows examples of common layout types consisting of a single column, or a combination of single and multiple columns. The numbers in the gray blocks indicate block numbers to help with the definitions of the more common zone labeling types described in Table 5.3.
The five important zones frequently appear in "first regular" or "second regular" zone order. In the "first regular" zone order, the title is near the top of the page, followed by author, affiliation in the upper part of the page (upper affiliation), and abstract. In the "second regular" zone order, the title is followed by author and abstract, with the affiliation appearing in the lower part of the page. The zone labeling type for each journal is determined by the journal layout type and the zone order. For example, if the journal pages are of layout type 121 [Figure 5.2(d)] and the affiliation appears in block 4 (second regular), the zone labeling type is defined as Type 12006. Other labeling types are described in Table 5.3.
5.2 Rule-based algorithms While all contiguous text regions on a page image are zoned, the zones of interest are the article title, author, affiliation and abstract. Since affiliation information could reside in the top part of the page as well as at the bottom, for labeling purposes, we define an "upper affiliation" and a "lower affiliation" zone. Hence, we have five possible labels. The remaining zones are labeled as "other". For each label type, there are four types of rules as shown in Table 5.4: rule types 1, 2 and 3 that are different for each label classification, and rule type 4 that is the same for all. Our rule-based algorithm consists of four steps. In the first step, a probability of correct identification (PCI) is used in rule type 1. Every zone has five PCIs, one for each label. A PCI is equivalent to the probability of a zone possessing a particular label. The PCIs are derived empirically. For example, in the case of upper affiliation, when more than 30% of words in a zone belong to the affiliation word list, the PCI of upper affiliation is 100. Otherwise, PCI is equal to PercentOfAffiliation * 100/30. In case of author, when more than 28% of words in a zone belong to the list of middle names and academic degrees, the PCI of author is 100. Otherwise, PCI = (PercentOfAcademicDegree + PercentOfMiddleName) * 100/28. In this first step, when a zone has the highest PCI for a particular label, it is assigned that label. The PCI thresholds of 30% and 28% for affiliation and author respectively are established heuristically. In the case of author, we often find there are two authors in an author zone, each author name usually consists of three words, and "and" is located between the author names. We find that there are middle initials and academic degrees associated with author names. So, a zone is likely to be labeled as author when the ratio of the sum of academic degrees and middle initials to the total number of words in the zone exceeds 2/7 or 28.6%. In the case of affiliations, it has been determined that a zone is likely to be labeled as affiliation when 30% of the words belong to the affiliation word list. In the second step, the labeling results from step 1 are rechecked by rule type 4. For example, when two zones are both labeled as author but one of those zones is located between title and upper affiliation, and the other is located between upper affiliation and abstract, the latter is removed from the author category. In the third step, in addition to rule type 2, rule types 1 and 4 are applied again to make sure that at least one zone is labeled as title, author, abstract, upper affiliation or lower affiliation. For example, when a zone initially labeled as author does not contain information relevant to author (NumberOfMiddleName=0 and NumberOfAcademicDegree = 0), its location is then used to do the labeling. That is, its label as author is verified by the facts that (a) it does not contain information related to title or upper affiliation zones, and (b) it is located between title and upper affiliation zones. In the fourth step, problems caused by zoning errors such as a zone split into multiple zones are handled by all rules, and any remaining unlabeled zones are labeled. 120 rules were generated for zone labeling types 10000, 10006, 12000, 12006, and 12200, and an example of detailed rules to detect upper affiliation is shown Table 5.5.
5.3 Evaluation of rule-based automated labeling Currently the AL module can reliably process 2,027 journal titles from the 4,300+ titles indexed in MEDLINE. Since NLM receives bibliographic data for 580 of these directly from publishers, the actual number of titles that may be processed by MARS is 1,447. In Table 5.6 we show performance data for the month of February 2001 for 159 journal issues containing 2,524 articles processed by MARS. There were 101, 10, 37 and 11 journal issues in zone labeling types 10000, 10006, 12000, and 12200 respectively. The data shows that 0.4% of the labeling errors is due to incorrect OCR output and 0.63% is due to poor zoning (AZ). The error rate attributed to the AL module itself is 0.20% when OCR and AZ are correct. The reason for the high error rate in the affiliation field is that text in this field is small sized and are frequently italicized, both factors contributing to poor detection by the OCR system. In overall performance, the AL module delivers an accuracy of 98.77%.
5.4 Ongoing research in rule-based approach As mentioned earlier, we used empirical methods to derive thresholds for the probability of correct identification (PCI) for each label, such as 28% and 30% of special word lists for PCI thresholds for author and affiliation. We plan to refine these figures by using statistical data, i.e., create histograms of every word list collected from the journals processed by MARS for each label zone, and select thresholds based on these histograms. Our objective is to accommodate all journal titles indexed in MEDLINE, but we find that a number of these do not follow the relatively regular layout types that the system can process at present. Figure 5.3 shows examples of these irregular layouts. One approach to dealing with these irregular layouts is to develop a template matching algorithm based on the average font size and the average top-left and bottom-right coordinates of all important zones. These features will be stored in the database in a journal-specific manner. When a journal issue with irregular layout is processed, the AL module will read the zone coordinates and the font size of the text in the zone, and match them against the stored information.
5.5 Automated labeling: artificial neural network approach In this section, we describe research toward automated labeling using integrated image and OCR processing, and a back-propagation artificial neural network. Basically, features are taken or calculated from OCR output and, after normalization, fed into the input layer of a neural net for label classification. Experimental results on a sample size of several thousand images of medical journal pages show that the system is capable of labeling text zones at a classification accuracy of 98%. 5.5.1 Zone features Features for this labeling technique are based on page layout analysis for each journal, and generic typesetting knowledge for English text.54 Sixteen geometric and non-geometric features are considered here, as shown below:
5.5.2 Method To use a neural network model as a pattern classifier, its structure has to be designed, and the network trained. Here, we discuss the selection of training and testing data sets, a method to train and test the neural network, and the neural network structure design. A back-propagation ANN for each journal type is designed, trained, and tested with journal-specific data. For each journal type, ("type" based on page layout and style characteristics that differ from one journal to another), a group of several journal issues is selected to create the training and test sets. For purposes of generalization, the cross-validation technique is used to divide these data sets. The training data set is used to design the ANN while the testing data set is used to estimate the classification accuracy. In addition to the labels of interest to us (title, author, affiliation, and abstract) all other zones are labeled as "others." Training and testing data sets Since each journal has its own page layout and style setting, we create a neural network for each journal type. For each type, a neural network is designed, trained, and tested with its own data. A group of at least four journal issues is selected to create the training and data sets for each journal type. Twenty-five different journal types consisting of 107 issues were selected for the experiment for a total of 2,948 binary images. Cross-Validation method For purposes of generalization, the cross-validation (CV) technique55 is used by randomly dividing the training data set into five data groups of which four data groups constitute a "CV-train" set and the one remaining group is considered the "CV-test" set. As a result, there are five pairs of a CV-train set and a CV-test set that are used to train and test the back-propagation neural network. Each neural network is trained and tested with each pair of a CV-train set and a CV-test set. The modified weights corresponding to the winning pair of a CV-train set and a CV-test set, the one yielding the highest classification accuracy, are chosen to be the final weights for the neural network. Back-Propagation neural network Back-propagation (BP)54-56 is a multi-layer ANN using sigmoidal activation functions. The network consists of an input layer, hidden layers, and an output layer, and nodes in each layer are fully connected to those in the adjacent layers. Each connection is associated with a synaptic weight. The BP network is trained by supervised learning, using a gradient descent method, which is based on the least squared error between the desired and the actual response of the network. We implemented a two-layer BP network with an input layer of sixteen text zone features, a five output layer (title, author, affiliation, abstract, and others), and a single hidden layer in which there are 8 nodes. The two-layer BP network architecture may therefore be characterized as 16-8-5. Each input vector of the training data set is presented to the network multiple times and the weights are adjusted on each presentation to improve the network's performance until maximum performance is achieved. Two learning factors that significantly affect convergence speed, as well as avoid local minima, are the learning rate and the momentum. The learning rate determines the portion of weight that needs to be adjusted. Even though a small learning rate guarantees a true gradient descent,57 it slows down the network convergence process. The momentum determines the fraction of the previous weight adjustment that is added to the current weight adjustment. It accelerates the network convergence process. During the training process, the learning rate was adjusted to bring the network out of either its local minima (where the network has converged but its output error is still large) or its no-gain mode (the network mode in which its output error does not change, or changes very little over many cycles). The learning rate ranges from 0.001 to 0.1, and the momentum is 0.6. 5.5.3 Results The BP neural network was trained with all five pairs of a CV-train set and a CV-test set. The average training time for each pair was about 8 hours. The network configuration associated with the winning pair was evaluated on the testing data set. The result was that the average classification accuracy on the testing data set was about 98.0%. Errors were due to inaccurate segmentation: for example, a zone of interest (such as title zone) split into multiple zones, as well as several different zones (such as author and affiliation zones) merged into a single zone. 5.5.4 Summary and conclusions Automated labeling using a back-propagation neural network showed encouraging performance for 25 different journal types, and showed the possibility of extension to other journals. The label classification time is fast and the results are stable regardless of journal type. However, there are drawbacks as well. For example, it is difficult to use the geometric relations between labels as features, and it is time consuming to train the module and tune its learning parameters. It is also hard to analyze wrong labeling results. The most serious drawback is that the entire neural network must be retrained for new types of journals. We propose to continue this investigation with other ANN paradigms. 5.6 Automated labeling: template matching with page normalization Observing that all first pages of articles in any journal follow the same general layout, we consider a template matching approach to label the zones. Simple template matching is unlikely to be successful because of geometric variability, but matching combined with page normalization proves to be a viable approach. Preliminary evaluation results using a sample size of several hundred images of biomedical journal pages show that our approach is capable of a label classification accuracy exceeding 96%. In addition to geometric and content-based features derived from geometric zone information and zone contents respectively, we propose a new feature called "single and multiple column zone vertical area string pattern" to normalize the page images. After normalizing the pages, zone features are calculated, and then used to create several predefined types of vertical area string patterns for the entire page. Finally, zone features and vertical area string patterns are input into a template matching system for the final decision on label classification. 5.6.1 String patterns Basic definitions necessary to describe our algorithms are given here. Single and multiple column zone vertical areas A single column zone vertical area of a binary image is defined as a vertical area in which only one text zone exists. A multiple column zone vertical area of a binary image is a vertical area in which more than one zone exists, and these zones are "vertically overlapped". Two zones are considered to be vertically overlapped if the top and/or the bottom coordinates of one zone are within the top and the bottom coordinates of another zone. Figure 5.4 shows an example of the single and multiple column zone vertical areas. Vertical area string patterns Let "M", "S", and "*" be the vertical areas of multiple column zone, single column zone, and empty line spaces. Let "C", "L", "R", "l", and "r" be the zone location features: Center, Left, Right, Left of Center, and Right of Center. Let "N", "Y" and "+" be No, Yes, and "Don't Care" respectively. There are two types of patterns: geometry-based and content-based patterns that are defined as follows. Geometry-based "single and multiple column zone" vertical area string pattern The "single and multiple column zone" vertical area string pattern is the combination of characters "M", "S", and "*" that represent the top-to-bottom vertical areas of a binary image. An example of this type of string pattern is shown in Figure 5.4(1) as "M*S*S*S*S*S*S*M*S". Geometry-based "zone location" vertical area string pattern The "zone location" vertical area string pattern is the combination of characters "C", "L", "R", "l", "r", and "+" that represent the relative location of a zone against the vertical middle line of a page. The following logic is used to determine the zone location in a page.
The CENTER_THRESHOLD is selected to be about two 12-point characters and for 300 dot per inch document, its value is about 100 pixels. The "zone location" vertical area string pattern of an image shown in Figure 5.4(2) is "++C+C+C+C+C+L+++C". Content-based "single and multiple text lines zone" vertical area string pattern The "single and multiple text lines zone" vertical area string pattern is the combination of characters "Y", "N", and "+". "Y" characters are for zones having more than one text line and "N" characters are for one text line zones. The "single and multiple text lines zone" vertical area string pattern of an image shown in Figure 5.4(3) is "Y+Y+Y+Y+N+Y+N+Y+N". Content-based "Nth order font size zone" vertical area string pattern The "Nth order font size zone" vertical area string pattern is the combination of characters "Y", "N", and "+". The smaller the order, the larger the font size. "Y" characters are for zones of which font sizes are categorized as Nth order and "N" characters are for zones not having Nth order font size. Examples of the "1st and 2nd order font size zone" vertical area string patterns are shown in Figures 5.4(4) and (5) as "N+Y+N+N+N+N+N+N+N" and "N+N+Y+N+N+N+N+N+N" respectively. Content-based "Nth order percentage of capital characters zone" vertical area string pattern The definition of "Nth order percentage of capital characters zone" vertical area string patterns is similar to the definition presented in the previous section. The difference is that the percentage of capital characters compare to total characters of a zone is used instead of the font size. The smaller the order, the larger the percentage. Figures 5.4(6)and (7) show the "1st and 2nd order percentage of capital characters zone" vertical area string patterns as "N+Y+N+N+N+N+N+N+N" and "N+N+Y+N+N+N+N+N+N" respectively. 5.6.2 Zone features Features calculated for this technique are based on an analysis of the page layout for each journal. Here are 16 features for each zone and 2 features for the entire page:
The page content left/right frame coordinate is defined as the left-most/right-most coordinate of text zones in an image page. Both geometric and content-based zone features are used to create several predefined types of vertical area string patterns for the entire page. These zone features and vertical area string patterns are then input into a template matching system for label classification. The purpose of creating vertical area string patterns, especially the "single and multiple column zones" pattern, is to normalize the document image page. Generally, the number of text lines in a labeled zone such as title, author, affiliation, or abstract is different from one article to another in a journal issue and therefore the labeled zone coordinates of one article may not be the same as those of another article. As a result, using the same document style guide, the geometric page layout of one article may not be the same as that of another article in the same journal issue. In order to overcome this problem of irregularity, we propose a new feature called "single and multiple column zone vertical area string pattern" that will be used to normalize the page images. As defined in section 5.6.1, the "single and multiple column zone" vertical area string pattern consisting of characters "M", "S", and "*" can be created by identifying vertical areas having single or multiple column zones from the top of a binary image to the bottom. Using this feature, we could have the same vertical area string patterns for document pages that use the same document style guide. Figures 5.4 and 5.5 show an example of two binary images having zone contents with different number of text lines but sharing the same "single and multiple column zones" vertical area string patterns. 5.6.3 Template matching algorithm The template matching algorithm matches vertical area string patterns of a binary image to those of predefined layout document structures of a given journal type to derive two types of similarity classification features: degrees of geometry-based similarity and degrees of content-based similarity. If both similarity measures exceed a predefined weight threshold, the label classification of a predefined article page will be used to label zones of an arbitrary binary image. The following summarizes the template matching. Set the weight matching to 0 5.6.4 Predefined vertical area string patterns For each journal type, a small set of article image pages are used to generate predefined vertical area string patterns and each pattern is labeled as title, author, affiliation, abstract, or "other". Since all labeled zones consist of text only, it is reasonable to automate the generation of the predefined vertical area string patterns along with their label classifications by matching the content of zones labeled by the user against that of an image. Let "1", "2", "4", "8", "0" be title, author, affiliation, abstract, and others. An example of predefined vertical area string patterns of an image article shown in Figure 5.4 using up to two orders is as follows (the numbers in parentheses corresponding to the strings in the figure):
5.6.5 Results Experiments were conducted with binary images selected from several different medical journals. A test sample consisting of 524 article page images from four different journal types was used, and the algorithm correctly classified 503 image pages, giving an accuracy rate of 96%. As in Section 5.5, errors were due to inaccurate segmentation leading to split or merged zones. 5.6.6 Summary and conclusions The technique provides meaningful labels for article titles, authors, affiliations, and abstract with high accuracy, and showed the possibility of extension to other journals. Our approach using the proposed feature called "single and multiple column zones" pattern can successfully handle pages with the same document style guide but different geometric page layouts.
6. Automated reformatting In many DIAU-based systems the syntax of the zone contents need to be reformatted to comply with conventions. In our case, we find that the text in the zones labeled as title, author and affiliation rarely conform to the syntactic forms required by MEDLINE's conventions. The research described here is directed to automatically rearranging this text to the desired formats to eliminate later manual correction. The procedure relies on predefined rules. Rules for the title field retain the capital case for the first letter of the first word, and de-capitalize all the other words with the exception of acronyms. An example: "Medical Management of AIDS Patients" becomes "Medical management of AIDS patients," as required in MEDLINE. Rules for author field take into account characters that delimit authors in a multiple-author list; tokens to be eliminated, such as Ph.D., M.D.; tokens to be converted, such as II to 2nd; and "particles" to be retained, such as "van." For example, the author name appearing on the printed page as Eric S. van Bueron, Ph.D. becomes Van Bueron ES as required in MEDLINE. For author and title reformatting, our algorithms use a subset of rules from the inclusive set of all rules. The selected rule set and the OCR output text are passed to the reformatting algorithm, and as each rule is applied, the OCR string is modified. The reformatting strategy for the affiliation field is quite different from the above. The OCR data for an affiliation field could contain many affiliations, since each author may have a different affiliation. This data is often difficult to reformat. One reason is that only the affiliation of the first author is to be retained, in line with MEDLINE conventions. Another reason is that the desired data is spread out over the entire field and not contiguous. For example, in a 30 word affiliation zone, we may only want to retain words 1-8, 12-14, and word 30. Our method involves probability matching of the OCR output text to historical data of ~130,000 unique affiliations. In addition to this processing at the reformat stage, we attempt to improve the recognition of affiliations by lexicon-based methods described in Section 7. 6.1 Reformatting the Author field Reformatting the author field uses forward chaining rules-based deduction. The reformat module can have many rules defined for a particular field. Each rule has a number of requirements among which are that it must
The third column in Table 6.1 shows the complete reformatted field. Note that a single rule or category does not necessarily complete the reformatting, but may need to be combined to achieve correct reformatting of the author field. With the eight categories defined, the first step is to define which rules are appropriate for a particular ISSN (journal title), since the printed format varies widely among journals. As an example, in one journal the authors appear as: This can be difficult to parse with a default set of rules, such as ', and' and ',' so that other rules need to be defined. By defining, in the database, the rules for a specific journal title over a specific period of time we can customize the rules to work for unusual or specific cases. The above example fails in the default rule set that only has ',' and ', and' as the author delimiters because this would incorrectly identify 'MD' and 'PhD' as author names. To accommodate this journal (and others like it) a high priority rule trigger list was created for author delimiters such as ', MD', ', PhD', 'Mr.', 'Dr.', and other formal titles. To avoid conflict among rules, each word chain is passed through all the categories recursively until no more rules are triggered. As long as we have an antecedent with consequences we continue to process the word chain. Using the forwarding chaining method, when an "if statement" is observed to match an assertion, the antecedent (i.e., the if statement) is satisfied. When the entire set of if statements are satisfied, the rule is triggered. Each rule that is triggered establishes, in a working memory node, that it was executed. During conflict resolution the reformat module decides which rules take priority over others via specificity ordering. An example would be: Our conflict resolution method specifies that the convert category is more specific than the reduce category, thus keeping the word 'Smith' and '2nd'. In addition, the pre-word convert flag in this particular example signals the conflict resolution manager to keep 'Smith', initialize 'J', and append '2nd'. This is possible because we have retained our original text and the converted text. The text did not change and an integrated rule has informed us that the word 'Smith' has remained unchanged, and by examining all words, we deduce that this is the last name. Example Before/After: At the category level, the conflict resolution strategy is specificity ordering. There is also a conflict resolution strategy within a given category: priority list rule ordering. Rules within a given category are assigned a priority level to avoid conflicts. An example of this is the following list of authors appearing on the printed page: We have the following author delimiter rules defined: ',' and ', and' However, the ',' is assigned priority 1, and the ', and' is assigned a higher priority 2. If we did not give a higher priority to ', and' we could end up with 'and' as part of the author name or create a null value. In ground truth testing of the author reformat rules system we tested 1,857 authors from OCR data. Of that number, 41 were reformatted incorrectly, for a 97.29% correction rate. Of those 41, all 41 were missing rules defined for a given case. An example of a missing rule is given in the case of an author field that reads: By simply adding the rule [', Jr. ' author delimiter priority 2], and with no changes in code, we achieved 100% correct reformatting in the test set. 6.2 Reformatting the Article Title field The title field uses the same principles as in the author rules system, but requires fewer rules or categories. Of the eight rule categories required for reformatting authors, only three are needed to reformat titles: Uppercase, Lowercase and First Letter Upper. 6.3 Reformatting the Affiliation field Institutional affiliations of the authors are reformatted by finding the best match between the OCR text and a list of about 130,000 correctly formatted affiliations obtained from the current production version of MARS. Simple string matching is not promising because of the myriad arrangements in which affiliations can be expressed. Most journals show the affiliations of all authors, but by convention only the affiliation of the first author is entered into MEDLINE. However, the text string corresponding to the first affiliation may be scattered throughout the OCR text for the affiliation field. As an example, when multiple authors are affiliated with different departments within the same institution, the printed affiliation may be "Department A, Department B, Department C, Institution XYZ," while the correct MEDLINE entry is "Department A, Institution XYZ." The problem is further confounded by OCR errors, especially errors in detecting superscripts and subscripts. To find a match, the entire OCR text of the affiliation field is compared with every entry in the list of existing affiliations. A matching score for each of the existing affiliations is calculated on the basis of partial token matches, distance between token matches and customized soundex matching. Tests show that our current version of affiliation reformatting successfully identifies the correct affiliation over 80% of the time when the affiliation is represented in the list. This success rate is expected to improve with parallel efforts to reduce OCR errors and the expansion of the list of affiliations from ongoing production data. The first step is to read all these unique affiliations into memory and create a ternary search tree53 for each affiliation, after which we create a soundex word list58 for each affiliation. When a zone is identified at the labeling stage as an affiliation field, the OCR data is first processed through a partial-matching algorithm. Low confidence characters are replaced with wildcards. Example: Uniuersity. The 'u' is actually a 'v' but the OCR engine assigned it as a 'u' with a low confidence level. The partial match algorithm replaces the 'u' with a '.' signifying that this character is a wildcard, and that any word in our search tree that has the pattern Uni<any letter>ersity is considered to be a match. The first step is to determine if a word in the affiliation zone matches one in the affiliation list. Ignoring implemented performance optimizations we perform a partial word match for all the words in the OCR list and build up a chain of those words that do match. We also track distances between chains. Consider the example of trying to find the affiliation "Department of Computer Science, University of Maryland" in the affiliation list. The OCR input string might look like: "Department of Computer Science, Department of Engineering, University of Maryland, Department of Computer Science, Johns Hopkins University." Since only the first affiliation is to be retained, there is considerable data that is irrelevant. The problem is to retrieve just the data needed. By word chaining we can find chains of words that exist in both the OCR text and in an affiliation zone and then use these to derive weighted probabilities. In this example there is a chain of 4 words that match, followed by 3 that do not match, followed by 3 more that match, and finally 7 that do not. Our probability algorithms compute chain word matches and distances between chained words. The next step in our process reverses the partial word match. The ~130,000 affiliations are matched to the OCR affiliation. Using the same example, "Department of Computer Science, University of Maryland" has 7 words and all 7 occur in our OCR word list. It is likely there is another affiliation entry that looks like "Department of Computer Science, University of Delaware". This would give a high match of 6/7 words. By comparing and weighting word matches from OCR to Corrected Affiliation and Corrected Affiliation to OCR, and using information such as the number of words matched, total number of words, chain of words matched, and chain of words unmatched, we arrive at a probability between 0 and 1. Note that partial matching is used to help cover OCR errors that would ruin a literal string pattern matching as the affiliation field is often in a smaller font and is likely to incur higher than normal OCR error rates. In addition to a partial match search algorithm, a soundex algorithm is used with the addition of OCR substitution. For the example in which 'Uniuersity" has the 'u' as low confidence, a substitution table developed lists of common OCR errors where a u == v == y. All three letters are substituted in the low confidence 'u' position, and if a word matches with a soundex hash it counts as a match. In our ground truth testing with affiliation zones,23 we found that if the OCR affiliation exists in our affiliation list of 130,000 entries, the probability that the affiliation match is the correct one is 88%. The affiliation reformat module picks the top 5 candidates which are presented for final text verification.
7. Lexical analysis to improve recognition DIAU techniques generally rely on OCR conversion of the document image. Overcoming the limitations of OCR is an important stage in most of these techniques, and lexical analysis approaches are often used for this purpose. Our work in lexical analysis techniques is motivated by two problems observed in production. The first problem was the excessive number of highlighted characters (which were actually correct, but assigned a low confidence level by the OCR system, and hence highlighted on the screen.) The second problem was the large number of character errors in the detected affiliations field, a consequence of small font size and italic attribute in the printed text in that field. Both problems placed an additional burden on the reconcile operators to correct and verify the text. Two modules, developed to solve these problems and reduce the operator labor, exploit the specialized vocabulary found in biomedical journals. While the modules use different techniques, both employ specially selected lexicons to modify the OCR text that is presented to the reconcile operators. (A detailed description of our design of the workstations used by the reconcile operators to verify the bibliographic record appears in our extended report.33) 7.1 Lexical analysis to reduce highlighted words 7.1.1 Problem Statement The OCR system in MARS was selected for its high rate of correctly recognized characters (high detection accuracy) and the very low number of incorrectly recognized characters that were assigned a high confidence value (low false positives). Confidence levels lie in a range between 1 and 9. Trading off the low percentage of false positives, we found that over 90% of words containing low confidence characters are actually correct, and that these characters should have been assigned a value of 9 by the OCR system. To draw the reconcile operators' attention to characters that may need correction, all low confidence characters are highlighted in red on the reconcile workstation screen. When these are mostly correct, the operators are unnecessarily burdened by having to examine and tab through them. Figure 7.1.1 shows a portion of the reconcile screen, with characters highlighted incorrectly, i.e., with the original confidence values from the OCR system.
In this example, part of the bitmapped image of the abstract field is displayed at the top of the screen and the corresponding OCR output text is displayed at the bottom. Although all of the OCR text is correct in this example, many characters are highlighted in red. Our objective is to reduce this number of highlighted characters. 7.1.2 Approach We seek to reduce the number of (incorrectly) highlighted characters by automatically increasing the confidence level of characters detected correctly by the OCR system. Our approach is to locate each word in the title and abstract fields that contains any low confidence characters, check for the word in a lexicon and, if the word is found, change the confidence of all its characters to 9, the highest value. A study was undertaken to determine criteria (heuristic rules) for selecting words to be checked and a lexicon suitable for biomedical journal articles. The key element of the study was the creation of a ground truth dataset with which to compare lexicons and lookup criteria. The ground truth data consisted of 5,692 OCR output words containing low confidence characters extracted from journals already processed by the MARS system. Each of these words was compared to the corresponding word in the final, verified bibliographic record created by MARS to determine if the OCR word was correct or not. Candidate lexicons and lookup criteria were evaluated with the goal of removing low confidence values from ground truth words that were correct, while retaining the low confidence values for those words that were not correct. Removing low confidence values from correct words is the "benefit" of the module. Removing low confidence values from incorrect words is the potential "cost" of the module. 7.1.3 Experiments and results Four candidate lexicons were created from various word lists maintained by the National Library of Medicine with the expectation that these would contain a preponderance of the biomedical words found in journal articles indexed in MEDLINE. The four lexicons and their combinations were tested along with several lookup criteria involving word length and character confidence levels. As expected, there was a tradeoff between benefit and cost. A large lexicon and no lookup restrictions removed low confidence values from over 90% of the OCR correct words (a 90% benefit), but also removed low confidence values from over 60% of the OCR incorrect words (a 60% cost). To ensure the integrity of the final text, it was considered on balance more important to minimize cost than to maximize benefit. Three combinations of lexicons and lookup criteria resulted in acceptable costs of less than 0.5% and benefits greater than 40%. The final choice correctly removed low confidence values from 46% of the correct OCR words and incorrectly removed low confidence values from 0.4% of the incorrect OCR words. The selected lexicon consists of unique words derived from NLM's SPECIALIST Lexicon and UMLS Metathesaurus. There are two levels of lookup criteria: 1) Words less than four characters in length, or containing no alphabetic characters are not checked. 2) Words less than six characters in length are not checked if any of the confidence values are less than 7. All other words containing low confidence characters are compared to the lexicon. If the word is found, the confidence values for all the characters are changed to 9, the highest value. 7.1.4 Implementation The lexicon checking module (Confidence Edit) was implemented at two phases of the project, originally for the first generation production system (MARS-1) and later for the current system. In MARS-1 we found that lexicon checking reduced the highlighted words on average from approximately 14% of the words presented for verification at the reconcile workstation to approximately 6.5%. This 50% reduction in highlighted words resulted in a 4% increase in production rate, and was reported in the literature60. For the current (MARS-2) system we added 9,386 words to our original lexicon. These were obtained by extracting all the words found in the verified and corrected abstracts from over 27,000 journals (= 230,000 articles) processed from May 1997 to April 2001, and using the frequency of occurrence of each word during that period. New words occurring at a frequency of 50 or more were added to the lexicon. Remaining words that occurred at least twice were checked against nine electronic dictionaries that are available to NLM. If the word was found in Dorland's Medical Dictionary, in the Oxford Medical Dictionary, or in at least six other dictionaries, it was added to the lexicon. When the new lexicon was tested with the original ground truth data, a 4% improvement in benefit with no increase in cost was measured. Using the expanded lexicon, Confidence Edit has reduced the percentage of highlighted words in the abstract field to approximately 3.5%. Figure 7.1.2 illustrates the effect of processing by Confidence Edit on the same document shown in Figure 7.1.1. Most of the characters are no longer highlighted.
7.2 Lexical analysis to improve recognition of Affiliations 7.2.1 Problem Statement As noted previously, although the commercial OCR system used by MARS performs well in general, accuracy is often poor for small fonts or italic characters. In particular, authors' affiliations frequently appear as small and/or italic characters, resulting in many incorrect characters in the affiliation field. Consequently, the final check and correction of the affiliation field requires a disproportionate amount of human labor compared to other fields extracted by our automated system. We observed that for about one in five affiliations, there were so many highlighted words that the operators preferred to type the entire affiliation rather than examine and correct each word. Not only was this time consuming, but also represented a potential source of error in the completed bibliographic record. In this section we describe the experiments that led to the design and implementation of the PatternMatch module that corrects words in the affiliation field. 7.2.2 Approach Words that frequently appear in the affiliation field in biomedical journals are drawn from a relatively small vocabulary that denote institutions and their divisions, such as University and Department, the various branches of medicine and biology, such as Pathology and Biophysics, and names of cities, states and countries. A study was undertaken to determine if partial string matching or other matching techniques could exploit the limited vocabulary to reliably find the correct word from a lexicon of affiliation words given an OCR output word containing low confidence characters. As in the previously described problem, a key component of this study was a ground truth dataset to compare matching techniques and lexicons. Words containing low confidence characters and the confidence values for all characters were extracted from the OCR output text in the affiliations field of over five thousand journal articles processed by the MARS system. Human operators selected the corresponding correct word from the affiliation fields of the completed bibliographic records. After words that contained no alphabetic characters were removed, the ground truth data consisted of over 20 thousand triplets, where a triplet consists of an OCR output word, the confidence values for the characters in the word and the correct word. Over 60% of the OCR words in the ground truth set are correct. Correct words and a count of their occurrences were extracted from the final, corrected affiliation field of approximately 230,000 journal articles that had already been processed by the MARS system. This set of journal articles is different from the set used to create the ground truth data. There were 96,982 unique words of 2 or more characters that occurred one or more times in this historical data. This list of words is the basis for candidate lexicons of affiliation words. 7.2.3 Experiments Six matching techniques61 were investigated using the ground truth data and the complete lexicon of affiliation words. Our goal was to find a technique to achieve a high match rate, a low false positive rate and fast processing. Most of the techniques return more than one potential match from a lexicon. If the correct word is among the list of returned words, it is considered a match. If no words are returned, it is considered no match. If words are returned, but none of them are the correct word, it is considered a false positive. The six techniques tested are: Whole Word Matching. This technique compares the entire OCR output word with each word in the lexicon. Either one word is returned, or none is. Because over 60% of the OCR words in the ground truth set are correct, the match rate was reasonably high. However, when using the complete lexicon, the rate of false positives was also high because a single OCR error or omission can result in a word that is found in the lexicon. For example, several non-English variations of the word University are included in the lexicon, including Universit, Universita, Universitaire, Universitat, and Universite. If the OCR word is "Universit", whole word matching will find "Universit" even though the actual word was "University" or one of its other variations. Partial Matching with wild card letters. In this technique, a match is sought with the "wild card" character '.' (a period) substituted for one or more of the low confidence characters in the OCR output word. Thus Partial Matching can find words in the reference dictionary where the OCR word has one or more character errors, but whose length is correct. For example, if we have an OCR word, Deparlmemt, with confidence values, 9699878956, a match would be found for Depar.me.t or D.par.me.t, but not for D.parlme.t or D.par.memt. For the same reasons as for Whole Word Matching, the false positive rate was high. In addition, the method does not find a correct match if the number of characters in the OCR word is incorrect. Near-neighbor Matching. This technique finds all of the words in the lexicon that are within a given Hamming distance of the OCR word. Hamming distance is a measure of the difference between two finite strings of characters, expressed as the number of characters that need to be changed to obtain one from the other. For example, "Deparlmemt" and "Department" have a Hamming distance of two. A high match rate could be achieved by specifying a large Hamming distance, but this also resulted in a high false positive rate. Near-neighbor Matching also fared poorly when the OCR word length was incorrect. Soundex Matching. Soundex matching finds words in the lexicon that are phonetically similar to the OCR output word. This technique proved to have a number of difficulties in overcoming character substitutions caused by the OCR system. For example, the word Department would often be interpreted as Deparlment by the OCR engine. These two words are phonetically quite different since the letters 't' and 'l' do not have similar sounds, and the correct word, Department, would not be returned. Bi-gram Search. This technique was adapted from software developed inhouse to suggest spelling alternatives for online Library clients. A bi-gram is a pair of adjacent characters in the word being analyzed. For example, Department contains nine bi-grams: De, ep, pa, ar, rt, tm, me, en, nt. For bi-gram searches, each word in the lexicon is searched for all possible bi-grams in the OCR word. Lexicon words containing multiple bi-grams in the OCR word are possible matches. Long OCR words can result in a large number of possible matches. The bi-gram prunes out unlikely candidate words through an algorithm that considers lexicon word length, OCR word length and the number of matching bi-grams. Bi-gram searches were less sensitive to OCR word length than some other techniques and resulted in a relatively high match rate. The false positive rate was also high, because one or two incorrect characters in the OCR word could result in many incorrect possible matches. Probability Matching. Probability Matching62 was developed inhouse specifically to address the problem of poor OCR recognition of words in the affiliation field. The first step of this technique compares the OCR output word to every word in the lexicon using an edit distance based on OCR character substitution frequencies, and assigns a confidence value to each lexicon word based on the probability that the OCR word will be produced when the true word is the lexicon word. Each word receives a score that is the product of the calculated confidence and the frequency of occurrence in the lexicon. Words are then ranked according to this score, and a specified number of highest-ranking words are returned. Probability Matching achieved the highest match rate, lowest false positive rate, and slowest processing time of the techniques tested. However it is slow: because it compares the OCR word with every word in the lexicon, it can take up to 1 second per word depending on word length, computer speed and lexicon size, with larger lexicons producing better matching, but slower processing. Initial experiments suggested further study toward an acceptable multi-stage process in which "easy" words are matched reliably by a faster process, and the remaining words are matched using Probability Matching. Combinations of matching techniques and lexicon sizes were tested in an effort to reduce the false positive rate and the processing time while maintaining the high match rates that had been observed. 7.2.4 Results The final choice was a cascaded matching process that capitalizes on the fact that over half of the OCR words are correct. It was found that trimming the lexicon to include only the most frequently occurring words significantly reduced the false positive rate of Whole Word Matching. Probability Matching with a larger lexicon for words not found by Whole Word Matching then yielded acceptable overall results with less impact on average processing time. The first step in our cascade matching is Whole Word Matching with a lexicon of 1,948 affiliation words with an occurrence of 100 or more in the historical data. Step 1 correctly matches 45% of the ground truth data, with a false positive rate of 1.4%. The second step is Probability Matching with the entire lexicon of 43,030 affiliation words with an occurrence of 2 or more in the historical data. Step 2 correctly matches 77% of the remaining 55% of the ground truth data, with a false positive rate of 16.7%. The overall performance of cascade matching was a match rate of 86% (with the correct word ranked highest for 81% of the words), a false positive rate of 11% and an average processing time of approximately 250 ms on a 500 MHz Pentium III. The still high false positive rate has implications for the way that potential substitute words are presented to the reconcile operator. 7.2.5 Implementation Word matching for low confidence words in the affiliation field was implemented through two software development efforts61. A separate console program called PatternMatch was developed to automatically parse low confidence OCR words from the identified affiliation field, submit the words to the cascade matching algorithm for possible correct words, and, if any words are returned, insert those words into the affiliation text following the original OCR word and specially tagged for final text verification. PatternMatch was developed in C++ in the Microsoft Visual Studio development environment. In addition to the MARS database for input and output records, PatternMatch requires three files: the large and small lexicons and the character substitution frequency matrix used by Probability Matching. Additional software was developed for text verification to support the interpretation of the special tags placed in the affiliation text by PatternMatch and the display of the word choices to the reconcile operator. This is treated in detail in our extended document33. PatternMatch has enabled the operators to select first match choice 80.8% percent of the time, select one of the other match choices 8.8% of the time, no word from the list 6.7% of the time, and the original OCR word 3.7% of the time. 8. DIAU: advantage Our expectation that the DIAU-based automated processes described in this report would substantially improve productivity is borne out by performance data. These results are calculated from instrumentation data recorded while the production system is operating. In Section 11 of our extended report,33 we present performance evaluation results, e.g., average time taken by automated and manual processes on a per citation basis; time taken to perform different functions in the scanning and edit operations; workload distributions; and other data in tabular and graphic forms. Here we show the value of applying document image analysis and understanding to the problem of bibliographic data extraction from digitized documents. The bar chart below indicates the labor needed to produce bibliographic records by three methods, at increasing levels of automation: (a) the completely manual keyboarding operation (conventionally done by almost all database producers), (b) the first generation MARS-1 system that used scanning and OCR for converting the abstract, with all other fields entered manually, and (c) the current (MARS-2) system that employs DIAU-based modules for data extraction. Instrumentation data is used to compute the figures for the MARS systems, and data as provided in contractor reports for the keyboarding operation. In Figure 8.1, the data is presented as the labor-minutes taken to produce 600 records, one day's workload. It may be seen that the MARS-2 system, as a consequence of its automated processes, needs less than 25% of the labor of the keyboarding operation.
9. Next tasks In addition to the next steps outlined in individual sections, we seek to initiate new projects: creating a ground truth database for research in document image analysis and understanding; techniques to extract bibliographic data automatically from online journals; and an alternative method for text verification. 9.1 Ground truth data: PathFinder 9.1.1 Introduction and objective Research in document image analysis is greatly dependent on ground truth data for the design, training and testing of algorithms for data identification and extraction. However, ground truth datasets and their associated analysis and visualization tools are usually created to analyze problems in specific applications and datatypes: skewed document images (Okun et al.),63 handwritten documents (Cha and Srihari),64 video sequences (Doermann and Mihalcik),65 statistical data (Swayne et al.),66 and speech signals (Barras et al.).67 Moreover, apart from the domain-specific nature of these datasets and tools, they are usually limited as to operating platforms and data formats, as described in an excellent taxonomy on this subject by Kanungo et al.68 To our knowledge no ground truth dataset exists that represents the corpus of biomedical journals, and none with the goal of extracting the text representing the bibliographic fields descriptive of the articles within these journals. In meeting this challenge, the main objective of the PathFinder (Public Archive To Help Find New Designs for Expert Ratiocination) project is to exploit the vast amounts of document images and OCR-converted and operator-verified data, collected in the MARS project, to aid in the development of innovative and efficient algorithms for automatic zoning, labeling and reformatting by the computer science and medical informatics communities. This is to be done by developing a ground truth database accessible via the Web. This ground truth data will include page, zone, line, word, and character level information. In addition to providing a public site for researchers worldwide to develop and test their algorithms, we propose a tool to enable them to graphically visualize the ground truth data and employ an automated analysis assistant. Code-named Rover (gROundtruth Vs. Engineered Results), this automated analysis assistant will compare the results of a researcher's program to the ground truth data. The ground truth and results data will be in XML, and Rover will be written in Java. The overall website development will use MacroMedia Dreamweaver UltradDev 4 to provide a rich interface and extensive database connectivity. 9.1.2 Design considerations Page layout. Identifying geometric features to design algorithms for automated data extraction is a non-trivial task since (as discussed earlier in this report) there is a variety of layout geometries in the 4,300 journal titles indexed in MEDLINE, though most follow the reading order paradigm of article title-author names-author affiliation-abstract. Ground truth data must therefore include samples of all significant layout types described in Section 5. Data format. XML for images, OCR-converted data and operator-verified data since XML excels in adaptation (to accommodate changes in data), maintenance, linking (from one piece of data to another), simplicity, and portability over networks, operating systems and development environments.72-74 Modifying existing data. Despite its undeniable value, the existing data has certain deficiencies. For example, although OCR output characters in error are corrected, their attributes (e.g., italics or bold) are not. But these attributes can serve as features to create algorithmic rules to correctly identify zones or labels. These will be corrected before inclusion in the ground truth dataset. 9.1.3 Rover: a visualization and analysis tool Once the data is available in XML format along with the original TIFF image, researchers need the functionality to:
The first two features are provided by TrueViz, described in a survey of visualization tools68. TrueViz is a public domain tool developed at the University of Maryland for visualizing, creating and editing ground truth and metadata. It is implemented in Java and works on Windows and Unix platforms (specifically, it has been successfully tested in the Windows 2000 and Sun Solaris 2.6 environments). It reads and stores ground truth and metadata in XML format, and reads the corresponding TIFF images. It has the ability to allow the user to inspect ground truth data at many levels, viz., at the page, zone, line, word, and character levels, and provides pertinent information at each level. For example, at the character level, such information includes the character code, font type and style, and bounding box (x1,y1,x2,y2) coordinates. TrueViz is a suitable platform to initiate the design of Rover. To serve as an effective analysis assistant, Rover will extend TrueViz to provide researchers the capability to compare their XML data to ground truth XML data graphically (Figure 9.1.1), and thereby help them iteratively refine their algorithms. In addition, Rover will provide statistics and visual presentations on specified areas. For example, the user would be able to use Rover to compare all characters in the dataset that are bold, and to enumerate errors. Rover would visually locate the mistakes and report statistics on the query. In this example it would report the number of bold characters, the number detected correctly, and the number detected incorrectly, both as absolute numbers and as percentages. Rover would also export this information to a database or spreadsheet for further analysis.
9.1.4 Ground truth distribution A website will provide access to the ground truth data, though it will go beyond serving as a simple data repository. Our objective is to encourage researchers to share and exchange ideas, as well as provide feedback to our development team for new desirable features. Figure 9.1.2 shows the organization of the website and how the elements interconnect. This section presents an overview of the website layout and functionality.
Product introduction. A Macromedia Quick Flash "movie" is displayed to the users, though this may be bypassed. We have already developed this movie and a demonstration of it is available at www.geocities.com/egalthan/Movie1.html. The metaphor used is a Sherlock Holmes style magnifying glass moving over a rare biomedical document. User login. On the main page of the website, users are asked for a name and password. This data will be stored in a SQL Server 2000 database along with the website data. We intend to use Macromedia Ultradev as the design tool, because it readily allows for password protection and database connectivity. While we do not anticipate restricting anyone from accessing this site, we intend to require all users to register. When registered users log in, they will be taken to the main user page. Unregistered users will be presented with a registration page before being allowed access. New user registration. The system will require first time users to enter a few important items of information that will enable us to provide security to the system as well as the ability to contact users with important new data or features added to the website in the future. The registration page will be developed using Ultradev which allows the developer a graphical design environment and provides rapid page development. Data access. Once logged in or registered, the user would have access to all the ground truth data and tools. Since the data collection will be quite large, the system will allow users to download the entire data set or any subset they choose. For example, some users may only be interested in certain types of journal layouts, such as double column abstracts. Data analysis. The users will be provided a link to launch Rover. While the initial version of this tool will possess the current functionality of TrueViz, the complete analysis support discussed earlier will be designed in a later phase. Bulletin board. This is for the user community to report bugs, and use as a forum for discussions related to algorithmic development. Here the users may also upload and download files, such as technical papers written, algorithm source code, and new ground truth data. 9.2 Data extraction from online journals 9.2.1 Background With the rapid expansion and utilization of the Internet and Web technologies, there is an increasing number of online biomedical journals, some of which are being indexed in MEDLINE. Online journals offer new opportunities in data mining, content extraction, creation of database records and related applications. However, there are new challenges also, and DIAU research is needed to capture, classify, analyze, extract, modify, and reformat Web-based document information for computer storage, access, and processing. We propose research leading to the design and development of an automated system, temporarily code-named WebMARS for Web-Based Medical Article Records System, to create citation records for MEDLINE. This system69 will download and classify document articles on the Web, parse and label contents of each article, extract and reformat the bibliographic fields to comply with MEDLINE conventions. 9.2.2 Project objectives The objectives are to research DIAU techniques leading to the design, development, evaluation and implemention of a practical system to automatically extract bibliographic data from online journals for the MEDLINE database. 9.2.3 Project significance With the increasing role of online journals in biomedical publishing, it is imperative that methods be developed to automatically extract bibliographic data with a minimum of human labor. Two advantages of such a Web-based document analysis and content extraction approach compared to either manual keyboard entry or the scanning/OCR approach are saving labor costs and improving performance (speed and accuracy). 9.2.4 Approach The functions in the proposed WebMARS system would be implemented by automated processes hosted by seven servers, and operators at two types of workstations. A high-level schematic is shown in Figure 9.2.1. Briefly, the WebMARS system will work as follows. An operator downloads an online journal from a publisher's website, and the PDF or HTML files of the articles are sent to the file server. The PDF to HTML files conversion server performs text conversion. The article zones creation and labeling server extracts text zones from the HTML files, and labels these zones as belonging to "title", "author", "affiliation", "abstract", "pagination", "databank accession number", "grant number", etc. The article zone contents modification server extracts, modifies, and reformats the text according to MEDLINE database conventions. The MEDLINE article zones creation and Web-based reconciling workstation selects and combines the appropriate portions of zone contents to create records that are viewed by operators for validation and reconciling. The reconciling workstation operator then checks, proofreads, and corrects any remaining problems in the records including character and symbol validation, citation reformatting, field label selection and confirmation. The SGML-based MEDLINE citation records creation server generates text zones directly from journal publisher SGML information, if available, that can be used to further improve labeling process. Finally, the MEDLINE citation records uploading server uploads the completed citation records to the NLM's DCMS database for later indexing. 9.2.5. Subsystems Each subsystem of the WebMARS system is described below. Web files collection and classification workstation. This workstation downloads and classifies Web-based files as abstract, full text, PDF or image files70. It consists of two processes: downloading and classification. The first is based on WinInet software tool and a combination of the Breadth First Search algorithm and the Constraint Satisfaction method to traverse the Web page's links, recognize and generate the successors of the downloading pages. The second process relies on the contents of the hyperlinks (name and address) to classify the files. Details appear in our extended report.33 PDF to HTML files conversion server. Occasionally, publishers offer certain Web journal articles as PDF files. Since WebMARS requires the HTML file format, this conversion server subsystem is used to convert the PDF files into HTML files. Article zones creation and labeling server. To parse and create text zones for each journal article based on its HTML file, and then label these text zones. The subsystem consists of three modules: the automated zoning module, the automated labeling module, and the automated updating module, discussed below. The automated zoning module relies on HTML tags to parse and then to create text zones for each article of a journal issue. The automated labeling module labels these text zones using a combination of statistics and fuzzy rule-based technology. Statistics are used to generate membership functions for the fuzzy rules-based method. Features and fuzzy rules derived for the automated labeling module are based on an analysis of the HTML layouts of journals. Both geometric and non-geometric features are considered here. Geometric features of a zone are based on location and order of appearance. Non-geometric features are derived from contents of zone, statistics, and font characteristics. Since text zones are characterized by the words they contain, word matching is an important operation in this module. For example, a zone has a higher probability of being labeled as "affiliation" when it contains words related to country, city, and school names. Fourteen word lists have been created, and the ternary search tree algorithm53 is used as a search engine for the word matching. The automated updating algorithm performs a difference analysis between text zones generated by this subsystem and corresponding text zones corrected by operators during the reconcile stage. Statistical labeled text zone information obtained from this analysis can be used to improve the labeling accuracy. Article zone contents modification server. This subsystem extracts, modifies, and reformats text in labeled zones according to MEDLINE database conventions. It consists of two modules: the label cleanup module and the cleanup coach module. The first prepares bibliographic record information for the reconcile operator, while the second operates on post-reconcile data to collect statistics in order to develop new rules to improve the performance of the first module. Details appear in our extended report. MEDLINE article zones creation and Web-based reconciling workstation. This workstation selects and combines the appropriate portions of zone contents to create MEDLINE citation records for text verification. MEDLINE citation records uploading server. This subsystem creates an XML file based on NLMCommon DTD71 for completed citation records of a journal issue and uploads the file to the NLM DCMS database for later indexing. SGML-based MEDLINE citation records creation server. To exploit publisher supplied information for zoning and to improve the labeling process. 9.2.6 Summary We propose research here to design a system that will automatically extract bibliographic records directly from online journals. A prototype has been evaluated using a sample of 28 medical journal Web sites. Preliminary results show the feasibility of our design to create MEDLINE citations directly and effectively using Web document pages.
9.3 Alternative method for text verification The conventional approach to verifying the text output of any OCR system, or a DIAU-based system such as MARS, is to present the text in the same sequence as it appears on the printed page, and to highlight the low confidence characters (in color) in the text words. The operator then "tabs" quickly from one suspected character to the next and make the necessary corrections. This conventional approach has some drawbacks. For example, the operator must detect the suspected character surrounded by a mass of correct text. Also, the text must be corrected as encountered, thereby breaking the rhythm of identifying incorrect characters. An alternative method is proposed that may prove to improve operator productivity. Called Carpet Character Certification and Correction, or the "Carpet" method, it involves grouping like characters (drawn from a number of pages or journal issues at the same time) and displaying them together in groups in a single window, as shown in Figure 9.3.1. Each character appears in its "edit box." Only low confidence A - Z and 0 - 9 characters, of the same type, would be displayed in groups. The example shows a set of characters in the edit boxes, numeric 1's. A "5" in one of the bitmapped images above the boxes has been misread as a "1" as shown. Since context is important to detect poorly captured character shapes, the system must provide the display of the image fragment (a word or phrase) that provides the context in which the (presumably) incorrect character appears, as is shown. Such context will particularly help distinguish letters or numbers that appear similar, e.g., 1, I or 0,O. The GUI for the Carpet system must possess the following functions:
We propose to develop the Carpet system using Visual C++ with the Kodak Image libraries. Then we will conduct a performance study using this system, and measure the residual error rate and the time taken for correcting and verifying all the characters from a complete journal issue at a time. Comparisons will be made to conventional text verification methods.
10. References 1. Thoma GR, Walker FL, Hauser SE. Biomedical document preservation by electronic imaging. In: Converting Information for WORM Optical Storage, Roth JP, ed., Meckler, 1990; 92-131. 2. Hauser SE, Thoma GR, Gass SI. Factors affecting the conversion rate of bound volumes to electronic form. Proc. Electronic Imaging '89, Boston MA, 1989; 1041-6. 3. Thoma GR, Hauser SE, Walker FL. Managing an archive of electronic document images. Proc. ASIS, vol. 26, 1989; 59-65. 4. Thoma GR, Cookson JP, Walker FL, Hauser SE. Interfacing optical disks to a document image storage and retrieval system. J. Imaging Technology, vol. 12, no. 5, October 1986; 288-92. 5. Thoma GR, Long LR, Berman LE. Design issues for a digital x-ray archive accessed over Internet. Proc. SPIE: Storage and Retrieval for Image and Video Databases II, vol. 2185; San Jose CA: 1994; 129-38. 6. Hauser SE, Hsu W, Thoma GR. Request routing with a back error propagation network. Proc. SPIE: Applications of Artificial Neural Networks IV, vol. 1965; Orlando FL: 1993; 689-95. 7. Thoma GR, Hauser SE, Le DX, Muller D, Walker FL. WILL: Advances in the management of interlibrary loan. Vol. 1. Internal technical report. Bethesda, MD: National Library of Medicine, Lister Hill National Center for Biomedical Communications, Communications Engineering Branch, September 1995; 26 pp. 8. Thoma GR, Hauser SE, Le DX, Muller D, Walker FL. WILL: Design of a standalone WILL unit. Vol. 2. Internal technical report. Bethesda, MD: National Library of Medicine, Lister Hill National Center for Biomedical Communications, Communications Engineering Branch, September 1995; 27 pp. 9. Walker F, Thoma GR. DocView: Providing access to printed literature through the Internet. Proc. 10th Integrated Online Library Systems Meeting, New York: Learned Information, Inc. May 1995; 165-73. 10. Thoma GR, Walker FL. Access to document images over the Internet. In: Rada R, Ghaoui C, eds. Medical Multimedia; Oxford UK: Intellect, 1995; 179-93. 11. Walker FL, Thoma GR. Multimode delivery of multimedia information. Proc. 14th National Conference on Integrated Online Library Systems, Medford NJ: Information Today Inc.; May 1999; 141-52. 12. Walker FL, Thoma GR. Access to document images over the Internet. Proc. 9th Integrated Online Library Systems Meeting, New York. May 1994; 185-197. 13. Le DX, Thoma GR, Wechsler H. Document image analysis using integrated image and neural processing. Proc. 3rd ICDAR'95, Montreal, Canada; Aug 1995; Vol. I, 327-30. 14. Le DX, Thoma GR, Wechsler H. Classification of binary document images into textual or non-textual data blocks using neural network models. Machine Vision and Applications, Issue 8, 1995, 289-304. 15. Le DX, Thoma GR, Wechsler H. Automated page orientation and skew angle detection for binary document images. Pattern Recognition, Vol. 27, No. 10, October 1994; 1325-44. 16. Le DX, Thoma GR. Automated portrait/landscape mode detection on a binary image. Proc. SPIE Symposium on Aerospace and Remote Sensing - Visual Information Processing II, Vol. 1961, Orlando FL, April 1993, 202-12. 17. Hauser SE, Cookson TJ, Thoma GR. Using back error propagation networks for automatic document image classification. Proc. SPIE Applications of Artificial Neural Networks IV, Vol. 1965, Orlando FL, April 1993, 142-50. 18. Le DX, Thoma GR. Document skew angle detection algorithm. Proc. SPIE Symposium on Aerospace and Remote Sensing - Visual Information Processing II, Vol. 1961, Orlando FL, April 1993, 251-62. 19. Le DX, Thoma GR, Wechsler H. Document classification using connectionist models. Proc. IEEE International Conference on Neural Networks, Orlando FL, June 1994, vol. 5; 3009-14. 20. Handley JC. Table analysis for multi-line cell identification. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 34-43. 21. Hu J, Kashi R, Lopresti D, Wilfong G. Table structure recognition and its evaluation. Proc. SPIE: Document Recognition and Retrieval VIII, Vol. 4307, San Jose CA, January 2001, 44-55. 11. Questions for the Board 1. For continued research toward automated extraction of fields not handled by our current techniques, especially in journals that exhibit irregular layouts, on which of these paradigms would the Board suggest we focus: artificial neural networks (back propagation, radial basis functions, or others), advanced template matching, decision trees, Bayesian networks or others? 2. To the Board's knowledge are there commercial OCR systems, classifiers or OCR toolkits that have been reported to reliably capture (correction detection > 95%, false alarm rate < 0.1%) biomedical symbols and Greek letters? 3. Is the ground truth data we propose to make available to the computer science and informatics research communities likely to be useful? 4. Does the Board believe there are other applications of document image analysis and understanding techniques useful to NLM or the biomedical community? | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CEB Home | CEB Projects | Related Work | Publications |
Repositories | NHANES | Site Index
URL: http://archive.nlm.nih.gov/pubs/reports/mars2001-DIAU/mars2001-DIAU.php
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||