| Skip navigation |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Automating the production of bibliographic records for MEDLINE5. Automated zoning The first step after the conversion of the bitmapped image by the OCR system is to apply image analysis techniques to block out ("zone") the regions of contiguous text, in particular those text groups corresponding to the bibliographic fields of highest interest: viz., article title, authors, affiliation, and abstract. Our survey of ongoing research in the application of image analysis to automated zoning described in the literature appears in Section 5.1. The commercial OCR system used in the MARS system includes a package to perform automatic zoning. However, our experiments showed that this function does not segment the page images into zones containing the bibliographic fields of interest with sufficient accuracy. The most common error made by the commercial automatic zoning function is that zones are too large and include more than one significant text group. Figure 5.1 illustrates a typical case where the title, author, abstract and affiliation are all in one zone, along with extraneous publication identification. Figure 5.2 illustrates another case where, in addition to the previous problem, a two-column abstract is grouped inappropriately into a single zone. In this example, the text lines in the two columns are joined, disrupting the proper reading order. For example, the middle text of the first line of the zone is incorrectly read as "..models have opment of..." Correct zones are critical to downstream processes in MARS-2. The stage that follows automated zoning is the automated labeling of the zones as title, authors, affiliation and abstract.22 This complex labeling process uses several items of information in each zone to determine its identity. Information used to label a zone include absolute and relative location of the zone, and key words within the zone. Clearly, the zone region must be correct if it is to provide useful information to the labeling program. Downstream from automatic zoning and labeling, the title, author and affiliation fields are automatically reformatted to comply with MEDLINE conventions23. This process also depends on correctly sized and labeled zones to be effective. Incorrect zones confound reformatting, ultimately requiring time-consuming manual intervention at the reconcile stage, thereby offsetting the advantage expected from an automated system. An alternative to automatic zoning is to require operators to manually draw, using special software and the mouse, correct zones onto the bitmapped images prior to the OCR process. This was done in the MARS-1 system to identify the title and abstract zones. It took operators about 14 seconds per image to draw these two zones. For the four zones needed in MARS-2, we can estimate that it would require about 28 seconds per image of operator time to perform manual zoning, a considerable burden. For example, for a production rate of 1,000 records a day, manual zoning would add over 7 person hours of labor, approximately equivalent to an additional full time worker. Since we cannot depend on the commercial OCR system to correctly zone images, and seek to eliminate manual zoning, we developed our own automatic zoning capability. With our own process, we free ourselves from depending on the commercial OCR system for automatic zoning, and can tailor the zone program design and operating parameters for images from the specific biomedical journals relevant to MEDLINE. However, rather than starting from scratch, we combine the automated zoning capability of the OCR system with our added functionality for zone correction.
5.1 Methods and procedures Much of the research reported in the literature employ methods analogous to those used to isolate and separate characters (symbol isolation) to segment page images into zones. A brief survey of activity in automatic zoning methods is given in Jain.24 Approaches include "top-down" methods,25 which segment a page by x-cuts and y-cuts into smaller regions, "bottom-up",26,27 which recursively grow homogeneous regions from small components, and combinations of both.24,28 Tradeoff factors include: granularity (finding small enough zones), computation time, and sensitivity to input parameters such as noise, skew and page orientation.29-31 Top-down methods tend to be faster and less sensitive to input parameters and page orientation, but require pages to have a "Manhattan layout", i.e., the blocks may be separated by vertical and horizontal lines. Bottom-up and combination methods often result in greater accuracy at the expense of computational complexity and sensitivity to input parameters. All of these methods zone the page using image data alone, prior to OCR conversion. Since the reported performance is variable, and because rich secondary data is available from our OCR system, our approach, in contrast, is to exploit the output data of the OCR system to implement automatic zoning. As noted earlier, in addition to ASCII text, the OCR system provides information about each of the converted characters in the output file. This information includes the level of confidence that the character was correctly recognized, character attributes such as italic or bold, character point size, and the x and y coordinates of the rectangle that bounds the character (bounding boxes)32. Thus we have both geometric and non-geometric feature information available for each converted character. Our approach is to draw upon these features to group text into correct zones. For example, we use the bounding box coordinates to determine which characters are grouped closely in the same region on the page. Information on character size and attributes provide additional clues for keeping groups of adjacent characters together or placing them in separate zones. Our zone correction method uses both top-down and bottom-up design strategies,41 used by other investigators on image data, on our OCR output (non-image) data. The overall procedure is outlined in Table 5.1, and an example is given in Figure 5.3.
The first step in creating new zones is to disassemble the original zones from the OCR system. Each zone is divided into individual text lines. In step 2, lines are further split horizontally into multiple lines when the space between words exceeds a distance threshold (empirically determined). This occasionally results in unnecessarily splitting lines into multiple parts, but is needed in order to split lines that originally span across two closely spaced columns, as shown in Figure 5.3. Some of these lines will be rejoined in later steps. The bounding box enclosing each line is computed, as are several features such as percent italic characters and average character height. Some character features, such as bold or italic, are available directly from the OCR output data. Others, such as character height or case (upper or lower), are computed from the OCR output data. In step 3, we combine the lines vertically into initial zones. The criteria for combining are that (a) the vertical distance between lines must be less than a threshold (again, empirically determined); (b) either the left edge, right edge or midpoint must be horizontally aligned; and (c) the features computed in the previous step must be similar. When a line is added to a zone, the zone's rectangular boundary is expanded to include the new line. Then all remaining lines are checked to see if they fall within the new zone. If so, they are added to the zone. Many of the horizontally split lines are recombined in this way. On rare occasion, some zones created in step 3 are too narrow. In this event, the fourth and last step is to combine such zones horizontally using criteria similar to those in the previous step. Here, the initial zones are combined if (a) the horizontal distance between the zones is less than a threshold; (b) either the top or bottom edges of the zones are vertically aligned; and (c) the computed features of the two zones are similar. When zones are thus merged, a new zone boundary rectangle is created to include both zones. Any smaller zones that fall within the rectangle are subsumed within this zone. Figures 5.4 and 5.5 show the results of these steps applied to the two images used as examples in Figures 5.1 and 5.2. In both of these images, the title, author, affiliation and abstract are enclosed in separate zones, as required. In addition, in Figure 5.5, the two columns of the abstract are in separate zones. These two zones will be identified as abstract by the automated labeling process, which follows the zone correction process, and the enclosed text will be organized in the proper reading order.
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
CEB Home | CEB Projects | Related Work | Publications |
Repositories | NHANES | Site Index
URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_5.php
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||