| Skip navigation |
||||||||||||||||||||||||||||||||||||||||||||||||||
| |
||||||||||||||||||||||||||||||||||||||||||||||||||
|
|
Automating the production of bibliographic records for MEDLINE12 Next tasks In individual sections we have outlined tasks that will improve the automatic and manual processes in MARS.In addition, we seek to initiate new projects that go beyond such incremental improvements. These include: creating a ground truth database for research in document image analysis and understanding; a system to extract bibliographic data automatically from online journals; and an alternative method that could improve the productivity of the reconcile (final verification) stage in the system. 12.1 Ground truth data: PathFinder 12.1.1 Introduction and objective Successful document image analysis is greatly dependent on ground truth data for the design, training and testing of algorithms for data identification and extraction. However, ground truth datasets and their associated analysis and visualization tools are usually created to analyze problems in specific datatypes: skewed document images (Okun et al.),48 handwritten documents (Cha and Srihari)49, video sequences (Doermann and Mihalcik),50 statistical data (Swayne et al.),51 and speech signals (Barras et al.).52 Moreover, apart from the domain-specific nature of these datasets and tools, they are usually limited as to operating platforms and data formats, as described in an excellent taxonomy on this subject by Kanungo et al.53 To our knowledge no ground truth dataset exists that represents the corpus of biomedical journals, and none with the goal of extracting the text representing the bibliographic fields descriptive of the articles within these journals. In meeting this challenge, the main objective of the PathFinder (Public Archive To Help Find New Designs for Expert Ratiocination) project is to exploit the vast amounts of document images and OCR-converted and operator-verified data, already collected in the MARS project, to aid in the development of innovative and efficient algorithms for automatic zoning, labeling and reformatting by the computer science and medical informatics communities. This is to be done by developing a ground truth database accessible via the Web. This ground truth data will include page, zone, line, word, and character level information. In addition to providing a public site for researchers worldwide to develop and test their algorithms, this system will enable them to graphically visualize the ground truth data and employ an automated analysis assistant. Code-named Rover (gROundtruth Vs. Engineered Results), this automated analysis assistant will compare the results of a researcher's program to the ground truth data. The ground truth and results data will be in XML and MARS Rover will be written in Java. The overall website development will use MacroMedia Dreamweaver UltradDev 4 to provide a rich interface and extensive database connectivity. 12.1.2 Design considerations Page layout. Identifying geometric features to design rule-based algorithms for automated data extraction is a non-trivial task since (as discussed earlier in this report) there is a variety of layout geometries in the 4,300 journal titles indexed in MEDLINE, though most follow the reading order paradigm of article title-author names-author affiliation-abstract. Ground truth data must include samples of all major layout types classified as follows:
Data format. The data format of choice is XML for images, OCR-converted data and operator-verified data since XML excels in adaptation (to accommodate changes in data), maintenance, linking (from one piece of data to another), simplicity, and portability over networks, operating systems and development environments. Moreover, XML is growing in popularity and its strengths have been proven in existing MARS modules such as CheckIn and Upload. One of the first tasks in the PathFinder project will be to select and convert the MARS ground truth data into XML Modifying existing data. While rich in the information it provides, the existing data has certain deficiencies. For example, although OCR output characters in error are corrected, their attributes (e.g., italics or bold) are not. But these attributes can serve as features to create algorithmic rules to correctly identify zones or labels. Therefore, some effort will be made to correct these before including them in the ground truth dataset. 12.1.3 Rover: a visualization and analysis tool Once the data is available in XML format along with the original TIFF image, researchers need the functionality to:
A survey of visualization tools53 identified TrueViz as a suitable platform for the design of Rover, since it provides the first two features noted above. TrueViz is a public domain tool developed at the University of Maryland for visualizing, creating and editing ground truth and metadata. It is implemented in Java and works on Windows and Unix platforms (specifically, it has been successfully tested in the Windows 2000 and Sun Solaris 2.6 environments). It reads and stores ground truth and metadata in XML format, and reads the corresponding TIFF images. It has the ability to allow the user to inspect ground truth data at many levels, viz., at the page, zone, line, word, and character levels, and provides pertinent information at each level. For example, at the character level, such information includes the character code, font type and style, and bounding box (x1,y1,x2,y2) coordinates. To serve as an effective analysis assistant, Rover will extend TrueViz to provide researchers the capability to compare their XML data to MARS ground truth XML data graphically, and thereby serve as a powerful tool in helping them iteratively refine their algorithms. In addition, it will provide numerous statistics and visual presentations on specified areas. For example, the user would be able to use Rover to compare all characters in the dataset that are bold, and to enumerate errors. Rover would visually locate the mistakes and report statistics on the query. In this example it would report the number of bold characters, the number detected correctly, and the number detected incorrectly, both as absolute numbers and as percentages. Rover would also export this information to a database or spreadsheet for further analysis.
12.1.4 Ground truth distribution. A website will provide access to the ground truth data, though it will go beyond serving as a simple data repository. Our objective is to encourage researchers to share and exchange ideas, as well as provide feedback to our development team for new desirable features. Figure 12.1.2 shows the organization of the website and how the elements interconnect. This section presents an overview of the website layout and functionality.
Product introduction. A Macromedia Quick Flash "movie" is displayed to the users, though this may be bypassed. We have already developed this movie and a demonstration of it is available at www.geocities.com/egalthan/Movie1.html. The metaphor used is a Sherlock Holmes style magnifying glass moving over a rare biomedical document. Figure 12.1.3 is a screen snapshot of the final frame of the movie.
User login. On the main page of the website, users are asked for a name and password. This data will be stored in a SQL Server 2000 database along with the website data. We intend to use Macromedia Ultradev as the design tool, because it readily allows for password protection and database connectivity. While we do not anticipate restricting anyone from accessing this site, we intend to require all users to register. When registered users log in, they will be taken to the main user page. Unregistered users will be presented with a registration page before being allowed access. New user registration. The system will require first time users to enter a few important items of information that will enable us to provide security to the system as well as the ability to contact users with important new data or features added to the website in the future. The registration page will be developed using Ultradev which allows the developer a graphical design environment and provides rapid page development. Data access. Once logged in or registered, the user would have access to all the ground truth data and tools. Since the data collection will be quite large, the system will allow users to download the entire data set or any subset they choose. For example, some users may only be interested in certain types of journal layouts, such as double column abstracts. The ground truth data will be organized in a hierarchical fashion as shown below.
Data analysis. The users will be provided a link to launch Rover. While the initial version of this tool will possess the current functionality of TrueViz, the complete analysis support discussed earlier will be designed in a later phase. At present, TrueViz may be used to view ground truth data and research data (e.g., results of algorithmic processing) following the XML defined structure. Rover will provide the users the ability to automatically compare ground truth data against research data, and perform statistical analysis tasks. Bulletin board. This section of the website will be available to the user community to report bugs, and use as a forum for discussions related to algorithmic development. Upload and download area. This section of the website is for the users to upload and download files, such as technical papers written, algorithm source code, and new ground truth data.
| |||||||||||||||||||||||||||||||||||||||||||||||||
CEB Home | CEB Projects | Related Work | Publications |
Repositories | NHANES | Site Index
URL: http://archive.nlm.nih.gov/pubs/thoma/mars2001_16.php
|
||||||||||||||||||||||||||||||||||||||||||||||||||