Dharitri Misra, Siyuan Chen, George R. Thoma
National Library of Medicine, Bethesda, Maryland
One of the most expensive aspects of archiving digital documents is the manual acquisition of context-sensitive metadata useful for the subsequent discovery of, and access to, the archived items. For certain types of textual documents, such as journal articles, pamphlets, official government records, etc., where the metadata is contained within the body of the documents, a cost effective method is to identify and extract the metadata in an automated way, applying machine learning and string pattern search techniques.
At the U. S. National Library of Medicine (NLM) we have developed an automated metadata extraction (AME) system that employs layout classification and recognition models with a metadata pattern search model for a text corpus with structured or semi-structured information. A combination of Support Vector Machine and Hidden Markov Model is used to create the layout recognition models from a training set of the corpus, following which a rule-based metadata search model is used to extract the embedded metadata by analyzing the string patterns within and surrounding each field in the recognized layouts.
In this paper, we describe the design of our AME system, with focus on the metadata search model. We present the extraction results for a historic collection from the Food and Drug Administration, and outline how the system may be adapted for similar collections. Finally, we discuss some ongoing enhancements to our AME system.
An effective technique for extraction of metadata from homogeneous digitized collections, or heterogeneous collections with a small number of text layouts, is to automate the process through machine learning techniques by developing classification models for individual layouts. From the contents of a classified and segmented document, metadata is extracted by searching for designated string patterns using different techniques [1][2]. This metadata may then be used for discovering records of interest through standard text search, or by browsing/searching individual metadata fields, after the collection is archived.
There are several well-known learning models such as the Naïve Bayes model (NB), the Support Vector Machine (SVM), and the Hidden Markov Model (HMM) [3][4] used for recognizing document layouts through classifying textlines in individual pages of a collection. The model parameters are trained with a training set in which individual lines are classified manually. This procedure can be applied to different collections by providing corresponding training sets.
The procedure to retrieve individual metadata fields from the segmented text, generated using the layout models, is not always simple. Often, search patterns need to be identified and applied using programmed instructions to locate and extract the metadata fields for each collection [5]. For complex cases, this is not only cumbersome but also error-prone, requiring a high degree of manual intervention and frequent program updates. Search procedures for individual collections are difficult to generalize across collections, and programmatic approaches may not easily accommodate new search criteria. A generalized rule-based search technique, adaptable for individual collections, offers a promising alternative for classifiable documents.
As part of a Digital Preservation project, we have developed an Automated Metadata Extraction (AME) system to extract metadata from digitized collections by combining the automated layout classification technique with a rule-based search technique. We generate recognition models for different document layouts based on a combination of Support Vector Machine and Hidden Markov Model [6]; and then use a metadata search model with encoded search/extraction rules to retrieve the field values from various text segments (page header, item header, item body, etc.) identified by the recognition models.
The metadata search model is initially created by incorporating a set of search rules determined by manual analysis of sample document pages. This model is improved iteratively by applying it to extract metadata from sample batches and then observing the outputs, which include a list of the of metadata fields that could not be identified for each item in a batch.
In order to accommodate complex cases that cannot be easily expressed using our search rules, the AME system allows incorporating specialized logic as a supplement to rule-based extraction. In addition, it allows for collection-level postprocessing of metadata. Finally, a GUI is provided for review and manual correction of extracted metadata.
In the following sections of this paper, we provide a description of our AME system, its workflow, and details of the metadata search model. We then present the results of metadata extraction for a collection of documents from the Food and Drug Administration. Finally we discuss the customization of the model for different collections, and ongoing enhancements to the AME system.
Our Java-based Automated Metadata Extraction system may be used in a stand-alone mode to extract, review and store metadata in XML format from OCR'ed texts of a scanned document collection. It may also be used as a library (Jar file), to be integrated with an application, where metadata for extracted items are provided in binary format through Java API calls. In the following sections, we discuss the stand-alone AME system.
The "OCR output" mentioned in the following sections refers to character-level features (such as geometric coordinates and font attributes) generated by the FineReader 8.0 OCR engine [7] from the scanned TIFF images of a corpus, and later formatted as a text file by a separate AME application. An “item” refers to an entity, such as an article or an official record, which may be archived and accessed as a unit.
The AME System consists of four main components, whose functions are described below:
Metadata extraction for a corpus starts after the metadata recognition models are generated and an optimal metadata search model is created for that corpus. The OCR'ed pages, along with the original TIFF files, are submitted to the Metadata Extractor to identify individual items in the set, and then to extract their metadata. Each submission may consist of a few to several hundred sequential pages. The metadata extraction workflow for a submitted batch is illustrated in Figure 1.
Figure 1. Metadata Extraction Processing Flow
Individual items within the batch are identified by classifying the text lines using the metadata recognition models. The textual data corresponding to each identified item is first corrected automatically for OCR errors, and then submitted to the Metadata Search Engine. A post-processing step is performed to apply any collection-specific logic or other cleanup for the extracted values. Associated information, such as the item's text (consisting of a part of a page to several pages), and the extraction statistics for the batch are also output at this point. Metadata review/validation may be performed as an optional manual step after all items are extracted, or the data may be stored on the disk for later review.
The structure of the Metadata Search model is shown in Figure 2. The components, in reverse hierarchy, are: SearchPattern, SearchRule, and ExtractionRule, each of which is specified as a node in the XML document and instantiated as a Java object during processing. Each ExtractionRule applies to a specific metadata field in one or more layouts of the collection. A SearchPattern is an abstract class (shown as a dotted box); its derived classes are: TaggedSearchPattern, LineClass- SearchPattern, TextSearchPattern, and DelimitedSearchPattern, each of which corresponds to a specific "search type" attribute.
It may be noted that each SearchRule encapsulates only one SearchPattern, whereas an ExtractionRule may contain several SearchRules, which are processed according to their specified search order in the model.
Figure 2. Metadata Search Model
The XML structure of an ExtractionRule (with a single SearchRule and SearchPattern) is shown in the boxed text below. Uppercase notations of the attribute values indicate string or numeric constants, which are used by the Metadata Search Engine. The value "FIELD", for example, refers to one of the known metadata fields for the corpus.
Table 1 shows the general attributes of a SearchPattern (shown as node attributes or child nodes in the XML representation) that are applied in locating and retrieving the metadata field in the specified section of the text. Note that the pattern assigned to a string refers to a Regular Expression represented as a Java String pattern [8].
Extraction Rule Structure
Table 1 – General Attributes of a SearchPattern
It may not always be practical to determine all the search rules and patterns for each metadata field before metadata extraction starts. Hence, the AME system enables an operator to update the search model through analysis of the missing fields in the extracted items using a graphical display window. A visual comparison of the scanned image text, OCR interpreted text and metadata values of an item, as shown in Figure 3, reveals missing search patterns that should be added to the model, as well as additional OCR corrections patterns. The manually updated model may be re-used for the same set of documents until most errors are removed; final corrections to the metadata may be applied manually if necessary.
Figure 3. Metadata display and correction screen
The validated metadata records are then stored as the ground truth for those pages for regression testing, and may also be made available to archive the corresponding items. This iterative process is shown graphically in Figure 4.
Figure 4. Search model update from analysis of missing field value
Characters generated by OCR from scanned TIFF images are often recognized incorrectly, especially for older documents. If an error occurs within a pattern specified in the search, the search would either fail or the results would be unreliable. The AME system presently addresses this by using a collection-specific Text Editor, which replaces frequently misinterpreted search words and patterns with their actual values, using built-in substitution patterns. The corrections may be customized to the level of individual metadata fields. (For example: The word III. may be an abbreviation for the state Illinois, or the number 111, or an error depending upon a location, a numerical field, or neither)
NLM has acquired a collection of medico-legal documents from the Food and Drug Administration, referred to as FDA Notices of Judgment (or FDANJ for short). It consists of about 70,000 published notices of judgment (NJ) from court cases involving products seized under authority of the 1906 Pure Food and Drug Act. The NJs are resources in themselves but also lead users to the over 2,000 linear foot collection of evidence files used to prosecute each case. Our goal is to create a digital library for browsing the collection as well as searching the collection's metadata and full-text.
The FDANJ collection comprises more than 41,000 pages, grouped under four different categories: Foods and Drugs (FDNJ), Drugs and Devices (DDNJ), Foods (FFNJ), and Cosmetics (CSNJ), with approximately 15,000, 22,000, 4,600, and 150 document pages respectively. These documents, published between the years 1906 to 1964 vary not only in their layouts, but also in their style and level of details within each category. For example, an NJ may span four or five lines in one set to tens of pages in another set. Figure 5 shows the four typical layout styles exhibited in these documents.
Figure 5. Typical layout styles of FDANJ documents
There are eleven metadata fields, shown in column 2 of Table 2, that are to be extracted from each NJ. Fields 5 and 7-11 are multi-valued, occurring more than once within a text segment. Depending upon the style, certain metadata fields may not be present in an NJ.
Table 2 also shows the number of extraction rules and search rules identified for individual fields of this collection.
Table 2 – Metadata Fields with Number of Extraction and Search Rules
Note that a single ExtractionRule may apply to more than one layout; also that the actual number of search patterns in a Search Rule may be effectively more than one, since a SearchPattern object may specify multiple Regular Expressions as CueWord, BeginPatterns and EndPatterns attributes.
A search is conducted for a metadata field, in sequential order w.r.t. the SearchRules, until a match is found using its SearchPattern. For multivalued fields, all values are extracted using the same search pattern.
The following are typical examples of begin and end patterns used in finding the Defendant Names (where patternOfDate and monthPattern are String constants for date and month, and "\\W" refers to any non-word character in a Regular Expression).
We processed more than 1200 document pages exhibiting all four layouts from the FFNJ category, grouped into 12 batches - each batch containing up to 250 pages. The metadata extraction model was initially developed by creating search patterns in collaboration with the curator of the collection. It was improved iteratively using the techniques discussed earlier, until an optimal stage was reached. After a set of batches was processed, a summary file was generated pertaining to each batch, which indicates the total number of NJs identified and processed, along with the percentage of misses for each field within the batch. These results and other findings for four test batches are shown in Table 3, and discussed below.
Table 3 - Percentage of Metadata Extraction Errors per field in Test Batches
Table 3 presents metadata extraction results from four sample batches in the FFNJ category, consisting of approximately 300 pages and 900 cases. The last column of the table shows the results averaged over these four batches.
The value in the left hand cell for each Batch indicates the percentage of cases for which a field could not be found by the metadata search engine. (A '-' indicates that the field is present in every case.) The value in the right hand side cell (with grey background) shows the actual percentage of erroneous or missing values for those fields. The discrepancy is explained as follows:
It is seen from the test cases in Table 3 that the actual error in extracting metadata for any field using the current model remains less than 10 percent, which is also confirmed from the results of other FFNJ test batches. This we regard as satisfactory, although the results should improve with better OCR correction algorithms.
Figure 6 illustrates the Web-based retrieval of an FDNJ record in an archive for the specific product "A1-Salve" by browsing through the "Product Keywords" metadata field.
Figure 6. Access to an FDANJ document by browsing a metadata field
The Metadata Extraction system has been designed with collection level portability in mind. To customize for a particular collection, the followings are needed:
There are several areas in which further work is ongoing or planned to make the AME system a robust tool to extract metadata from other semi-structured documents with relative ease.
We have developed an automated metadata extraction system using layout recognition and metadata search models. The rulebased search model, incorporating search rules as string patterns (represented as Regular Expressions) has been applied to a semistructured text corpus from the Food and Drug Administration. It was successful in extracting embedded metadata with more than 90 percent accuracy and indicating where a search failed. The system is designed for easy customizing to other collections with similar characteristics.
The authors acknowledge the collaboration of John Rees, the curator of the Archives and Modern Manuscript program at NLM's History of Medicine Division, in helping with the development of the layout models and the metadata search model for the FDANJ collection, and on other aspects of this work.
This research was supported by the Intramural Research Program of the National Institutes of Health, National Library of Medicine, and Lister Hill National Center for Biomedical Communications.
Dharitri Misra is Lead Consultant at Aquilent, Inc. and is a researcher at the U.S. National Library of Medicine. Her work involves developing experiments and tools to help in the preservation of digital resources, with automated extraction of metadata from text documents. She earned her M.S. and Ph.D. degrees in Physics from the University of Maryland.
Siyuan Chen is a postdoctoral fellow at the U.S. National Library of Medicine. He earned his M.S. and Ph.D. degrees in Electrical Engineering from the State University of New York at Buffalo. His research interests include handwriting recognition, optical character recognition, pattern recognition, and machine learning.
George R. Thoma is a Branch Chief at an R&D division of the U.S. National Library of Medicine. He directs R&D programs in document image analysis, biomedical image processing, animated virtual books, and related areas. He earned a B.S. from Swarthmore College, and the M.S. and Ph.D. from the University of Pennsylvania, all in Electrical Engineering. Dr. Thoma is a Fellow of the SPIE, the International Society for Optical Engineering.