DocView: Providing Access to Printed Literature through the Internet
Walker, F.L., Thoma GR.
Proceedings IOLS'95, Medford NJ: Learned Information, 1995. p165-173.
Abstract:
DocView is a prototype Windows application that provides an end user two ways of accessing and using printed literature through the Internet. First, DocView permits printed documents to be received over the Internet from remotely located Ariel stations. Ariel provides document access in a manner similar to facsimile transmission, except that the communications medium is the Internet rather than the telephone system. Second, DocView serves as a viewer for World Wide Web (WWW) client applications such as NCSA Mosaic. DocView provides viewing capability for document literature stored in the form of multi-page TIFF images. Printed literature stored in this format on World Wide Web, Gopher or FTP servers can be accessed and downloaded by Mosaic, then viewed through DocView. Once DocView has received documents either from an Ariel system or an Internet client application, it provides the user with tools for using the electronic images. These tools include displaying the pages on the screen, manipulating the images (zoom, scroll, pan), copying portions of pages of interest, electronically marking desired pages, and printing the pages. This paper describes DocView and details recent developments.
1. INTRODUCTION AND BACKGROUND
Electronic access to bibliographic and full-text databases has been routinely done for many years, but the electronic retrieval of literature already printed is rare even today. A major problem hindering electronic delivery of the printed literature is that very little published literature is available in electronic form. In most cases, only new information is becoming available through electronic media such as the Internet, rather than previously published literature. However, some pilot projects involve accessing images of the printed literature over Local Area Networks (LANs). Examples are the Red Sage project at the University of California, San Francisco, and the TULIP projects conducted at a dozen universities throughout the country. The question of how best to electronically distribute literature that is already printed over the Internet has to be addressed. A major technical issue is to render document pages in a manner suitable for display by an end user, particularly for compound documents containing both text and graphics on the same page. Documents can be represented in different formats, in bitmapped form, in a page description language format (e.g., Adobe's PostScript or Portable Document Format), and in a hypertext form. If the documents are not already in electronic form, the paper can be converted either to bitmapped images (using a scanner) or machine-readable textual data (by OCR or manual keying). Access to either text or images is made possible by the advent of the World Wide Web server and suitable clients such as NCSA's Mosaic. This client/server interaction requires documents or other objects to be accessible in a HyperText Markup Language (HTML) form that permits a hypertext link from the "home page" of the WWW server to objects, and from a word or region in an object to other objects that may be machine-readable documents or bitmapped images. Document viewers might be required to display the documents on the user's screen. If the literature is delivered in PostScript, a PostScript viewer may accurately render the pages on a video monitor, or the pages may printed on a variety of printers. Documents may also be formatted in a Portable Document Format (PDF) and displayed by viewers such as Adobe Acrobat, which is available for several computer platforms. Documents may also be formatted in any of the major word processing formats, such as WordPerfect or Microsoft Word, which can accurately render pages both on-screen and upon printing. Finally, the printed literature may be scanned and distributed as bitmapped images in popular formats such as Tag Image File Format (TIFF).[1]
The task of converting the existing base of printed literature to electronic form is daunting: there have been trillions of pages printed. Two methods of conversion are widely used for producing HTML, PostScript or a portable document format: retyping and Optical Character Recognition (OCR). Retyping the literature is very slow and expensive. Recent improvements in OCR make it possible to rapidly convert faxes or scanned pages to ASCII text. Unfortunately, OCR suffers from reliability problems: accuracy decreases with poor print to background contrast, and degrades as print becomes smaller in size, and the number of fonts increases. Even with advertised OCR accuracies of 97 or 98 percent, errors must be manually corrected if the information content is to be delivered accurately. This leads to significantly slower conversion throughput and significant labor costs. Finally, OCR usually does a poor job in delivering font format information, and providing page composition. This is a job that must be manually done after OCR, if both text and graphics are to be rendered accurately. In place of retyping and OCR, producing compressed bitmapped images upon scanning the document remains a reasonable option. This conversion technique has the advantage over OCR and retyping by being relatively fast, and by being able to produce an accurate rendition of the printed page. Its main disadvantage is the size of the resulting bitmapped image. However, standard compression schemes can reduce the size of the image, making this electronic document format attractive. One of the National Library of Medicine's two R&D divisions, the Lister Hill National Center for Biomedical Communications, has undertaken several pilot projects to address the need to both preserve and deliver biomedical literature. One project, called SAIL (System for Automated Interlibrary Loan)[2], automates delivery of documents requested by users by using a preselected digital image store for a small collection of journals. It consists of a network of PC-based workstations that: retrieves document requests from the Library's mainframe computer, parses the requests into requester and document information, uses the document information to retrieve images from an optical disk archive, and uses the requester information (e.g., fax number, user address) to automatically fax the images to the requester. For users who want the articles by mail, SAIL prints the page images automatically, along with the cover of the journal issue and a printout of the request. On a small scale, SAIL helps to minimize human intervention in the delivery of the articles.
As designed, SAIL is limited to delivering documents by mail and fax. Though adequate for the intended purpose, both have well-known drawbacks. Mail tends to be slow and unreliable, and can get lost. Facsimile transmission is faster than mail, but tends to be more expensive. Fax also has the problem that it can be unreliable: pages can be garbled in transmission. Also, conventional fax is limited to 200 dots per inch resolution, which compares unfavorably with the best photocopy. By using the Internet for document delivery, it should be possible to have a faster, cheaper, more reliable and more convenient method than either fax or mail. This is because the Internet offers higher speed, higher image resolution and lower transmission cost. These advantages promise to become even more pronounced as the backbone speed of the Internet, currently at T3 (about 45 Mbps), moves up gradually to OC-3, OC-12 and eventually to Gigabits/second speeds. These favorable characteristics of Internet inspired the DocView R&D project. DocView is a software program intended for the end user. It currently uses bitmapped images produced from scanning printed literature. DocView helps the end user obtain printed literature through the Internet electronically, and to use it electronically. It is currently in prototype form and will be beta tested in 1995. This paper covers the features of DocView, and details recent DocView developments.
2. OVERALL DESCRIPTION
Intended to provide document images to end users via the Internet, DocView is a Microsoft Windows application. The minimal hardware and software requirements for running DocView are listed below. In addition to the files associated with DocView, the computer running DocView requires:
- a PC having an Intel 80386 or better processor, with a speed of at least 33 megahertz
- a minimum of 8 megabytes memory in the computer
- a VGA monitor (640 x 480 resolution) or monitor of higher resolution
- an Internet connection, either provided directly by an Ethernet or Token Ring board, or a dial up connection provided through Serial Line Internet Protocol (SLIP) or Point-to-Point Protocol (PPP)
- a printer to print received documents (laser printer preferred)
- a mouse for manipulating images
- Microsoft Windows 3.1, Windows for Workgroups 3.11 or Windows NT 3.5
- TCP/IP Stack with Windows Sockets Dynamic Link Library (DLL)
- Gopher or World Wide Web Client (not required for reception from an Ariel station)
A major goal for DocView is the capability of receiving document images sent over the Internet from remotely located Ariel stations. Ariel is a software package developed by Research Libraries Group for a workstation comprising a PC, a scanner and printer.[3] Many libraries are beginning to use Ariel for interlibrary loan in a "fax-like" manner, but via Internet. DocView provides Ariel-compatible communication, recognizes Ariel documents, and allows image viewing, manipulation and printing after document reception. To receive a document such as a journal article, the DocView user may contact a library through email, telephone, fax or other electronic means, and ask for the article to be sent directly to his computer. The library then scans the article using an Ariel station, which sends it to the user's computer. A blinking DocView icon alerts the user to the arrival of the document.
In addition to providing compatibility with Ariel systems, DocView can serve as a viewer for multi-page TIFF images. When used in conjunction with another Internet client application running on the same computer, DocView provides document viewing capability for that client. An example of a compatible client is NCSA Mosaic.[4] This client can access documents on FTP servers, gopher servers and World Wide Web (WWW) servers. For instance, an information provider such as a library can provide documents on demand by using a WWW server. The WWW server may be configured to allow a user to browse through a list of available documents such as journal articles, to select one and receive it immediately. After Mosaic downloads the multi-page TIFF document file to the computer, it spawns DocView to provide image viewing, manipulation and printing capabilities.
Figure 1 illustrates the five sources from which a DocView user may obtain documents. Besides Ariel systems, the remaining four employ servers, which are useful for repeated access to a large document collection: DocView Server, FTP Server, Gopher Server and World Wide Web Server. Early in the project a DocView Server was created to run on a UNIX platform. This provided a means of experimenting with document delivery over the Internet. DocView accesses the DocView server directly through its built-in communications subsystem. It also acts as a viewer for documents from FTP Servers, Gopher Servers and World Wide Web Servers obtained through an Internet client application such as NCSA Mosaic, which handles the communications. Regardless of the document access method, DocView facilitates document delivery over the Internet. Although the computer running DocView may be physically separated from the source of documents by thousands of miles, document delivery happens in a matter of seconds or minutes.
The features and facilities of DocView's user interface include:
- Receive and view documents sent from Ariel Workstations
- Receive and view TIFF pages via an Internet client application
- Image expansion software for compressed images
- Display multiple documents on the screen simultaneously
- Manipulate document images by zooming, scrolling, and panning
- Create electronic bookmarks to keep track of important pages
- Print pages of one or more documents
- Copy function for creating new documents from those received over the Internet
- Built-in tutorial for new users
- Send comments over the Internet to DocView's developers
The images DocView handles may be in CCITT Group III or IV, PackBits or uncompressed format. Of these compression techniques, Group IV achieves a better compression ratio. For a typical biomedical journal page scanned at 300 dots per inch, the uncompressed size is about 1 megabyte, while the Group IV compressed size ranges from 70 to 100 kilobytes, for a compression ratio that is better than 10 to 1. For each journal article, it is convenient to store the pages in groups of images. Multi-page TIFF format is convenient for doing this because it permits several images to be stored in a single file. A WWW server can provide pointers to the location of the multi-page TIFF files. An example is illustrated in Figure 2, which shows an example of how it might be possible to browse through an electronic journal. Here, the user starts with a list of available volumes, chooses a volume, sees a list of issues available for that volume, chooses an issue, and then sees a table of contents for that issue. Upon choosing an article from the table of contents, the multi-page TIFF file corresponding to that article is downloaded and made available for viewing. While the WWW client such as Mosaic handles the transmission of the file across the Internet, DocView provides image display. In addition to image display, DocView permits the user to manipulate the images (zoom, scroll, pan), copy portions of pages of interest to the clipboard, electronically "bookmark" desired pages, and print only the pages needed. DocView also provides a convenient function for quickly returning control to the Internet client application when the user has finished viewing the article.
3. RECENT DOCVIEW DESIGN DEVELOPMENTS
Details of DocView's design are given in [5]. Figure 3 shows the structure of the DocView software when configured to receive Ariel documents and to provide viewing capability for TIFF images. In the center, DOCVIEW.EXE provides the user with functions for manipulating document images. The two modules that communicate with Windows Sockets are the Internet client and ARIEL2.EXE. The latter module is designed specifically for compatibility with remote Ariel machines. The original Ariel software version 1.12 used a modified form of TFTP[6]for document transmission. DocView was first made compatible with this communication. In September 1994 Research Libraries Group released a new version of Ariel that uses a modified form of FTP[7]for document transmission. DocView has been redesigned to be able to receive documents using this new method. In preliminary testing, it has been found that Ariel's FTP transmission of documents is faster than its TFTP transmission. Both methods have been shown to be equally reliable. When anAriel document arrives at the DocView client computer, the ARIEL2.EXE module informs DOCVIEW.EXE using Dynamic Data Exchange. DOCVIEW.EXE activates an icon to notify the user that a new Ariel document has arrived and is ready for viewing.
In its early implementation, the prototype DocView software used a compression board that handled image expansion. Manufactured by Kofax Image Products, this board provided rapid image display and fast printing, however with the serious drawback of cost: about $2500. This would have put DocView beyond the reach of most potential users. A second generation DocView used software from Accusoft Corporation for image expansion, rotation and scaling, to replace the Kofax board. Its image expansion times were faster than the board's, but printing became slower. While the board could print a page in ten seconds, printing with Accusoft was about 45 seconds. However, this may not be a significant disadvantage, since printing takes place in the background, and most users can wait a few minutes to print a typical document. We also experimented with Kofax software in place of the Accusoft software, to provide image expansion, scaling and rotation.
The current beta version of DocView uses software designed in-house that handles all image processing functions. It replaces the Kofax and Accusoft products. The main advantage of designing this software in-house is that bugs can be easily fixed. Also, with all DocView software designed in-house, distribution of DocView is simplified. The current image processing software handles not only the expansion of Group III, Group IV, and PackBits images, but also provides image compression for Group IV. Image compression will be used for image editing and document transmission (to facilitate document sharing among users), when those features are later integrated into DocView. A comparison of image expansion times for the software designed in-house, Kofax and Accusoft are given in Table 1. To measure these times, a typical ten page biomedical journal article was scanned at 300 dots per inch, compressed and stored on disk. The times shown are the average image expansion times. The measurements were made on a 66 Megahertz 486 computer. The time taken for image expansion is important since it is the major component of the total time needed for displaying an image. System response time, and ultimately user satisfaction, are dependent on the image expansion time. The software designed in-house has an image expansion speed slightly slower than that of Kofax software. The Accusoft software is considerably slower, with the Kofax hardware being the slowest. Image rotation times for Kofax software and that designed in-house are comparable; both take less than one second to rotate an image, while the Accusoft software takes eight to ten seconds to rotate a typical biomedical journal image.
| Kofax Software | 0.40 sec |
|---|---|
| In-house Software | 0.55 sec |
| Accusoft | 1.02 sec |
| Kofax Hardware | 1.73 sec |
Other recent DocView design developments include a built-in tutorial that is part of DocView's help facility. New users should find this to be particularly helpful when using DocView for the first time. Also, a DocView User Manual was created to aid first-time users, and a Setup disk was created for easily installing DocView.
4. DOCVIEW EVALUATION
DocView is being beta tested beginning in 1995. Several beta test several sites will be selected to provide feedback on DocView. Two testing scenarios are possible: one for receiving Ariel documents, and one for accessing on-line literature. Libraries are expected to provide the majority of test environments. Some test sites may want to use Ariel to send documents to users equipped with DocView. Others may want to experiment with on-line collections placed on an Internet server. Users will be able to access those collections through an Internet client and DocView. Among the questions to be addressed by the evaluation are the following:
- Does direct Internet access to an on-line collection of documents result in time and cost savings? By how much?
- Does document delivery directly to the desktop using Ariel with DocView result in time and cost savings over conventional means of document delivery? By how much?
- What functions/features of the user interface contribute to easy document usage, speed of task completion, and overall user satisfaction?
Answers to these and other questions will direct the future development of DocView. User feedback will be collected in two ways: a paper questionnaire and a built-in comments function. First, a user satisfaction questionnaire will be distributed with DocView to each beta tester. After using DocView for a fixed amount of time, the user will answer the paper questionnaire and return it to the developers. The questionnaire will elicit information on whether DocView improves the manner in which people use their libraries. It will gather the user's impressions of DocView, and find out whether DocView increases the number of documents that users request from their libraries. The questionnaire will solicit user impressions on the legibility of DocView images. It will gather information about whether people think DocView is easy to learn. The questionnaire will elicit opinions on the practicality and usefulness of DocView's on-line tutorial and printed user's manual. Finally, it will find out what people think about DocView's dependability, its capabilities and its various features.
In addition to the paper questionnaire, beta testers will be offered a comments function built into DocView. Appearing as an item under DocView's help menu, the comments function allows testers to send comments, questions and criticisms electronically to DocView's developers. The users may optionally enter information identifying themselves so that DocView's developers may be able to contact them if necessary. After a user fills in the on-screen form, the comments are sent over Internet to a server running continuously day and night at the Lister Hill Center specially designed for collecting the comments.
5. FUTURE DEVELOPMENTS
Future DocView developments will hinge on results from the beta testing. With favorable feedback, a number of steps are planned. In the immediate future DocView's image processing software will be optimized for performance. This will increase the speed of image expansion, display and rotation. DocView will also be modified to run as a 32 bit application under both Windows NT and Windows 95. A major design goal is to provide a document request function that empowers the end user to electronically send a request to document suppliers having Ariel stations. This will supplement or eliminate telephone calls, trips to libraries, and other means that people have been using to request documents. A second goal is to enable a user to send documents via DocView in addition to receiving them. This will make it easy for researchers to share documents with their colleagues. A third goal is to add character recognition to convert the received bitmapped images to text, enabling the users to search the document for relevant information. A fourth goal is to extend the DocView client software to other computer platforms such as the Macintosh and UNIX computers. In a fifth goal, other methods of document transmission will be also be investigated for possible integration into DocView. An example is the Multipurpose Internet Mail Extension (MIME)[8], which will allow images to be sent via Internet email. MIME could prove to be a viable alternative for systems such as Ariel, since it is not tied to a specific hardware platform; any computer could provide the source of images. Finally, image editing functions are being considered for DocView. These will allow a user to add or delete pages, and change the images as they appear when displayed or printed.
6. REFERENCES
1. TIFF Revision 6.0, Aldus Corporation, June 3, 1992.
2. System for Automated Interlibrary Loan: System and Operations Description. Internal Technical Report of the Communications Engineering Branch, Lister Hill National Center for Biomedical Communications, National Library of Medicine. November 1992.
3. Bharadwaj R. The Ariel project. Proc. ASIS, vol. 28 (1991); p.339.
4. NCSA Mosaic for Microsoft Windows, National Center for Supercomputing Applications, University of Illinois at Urbana-Champaign.
5. Walker FL, Thoma GR. Access to Document Images over the Internet. Proceedings IOLS'94. Medford NJ: Learned Information, 1994; 185-97.
6. Sollins, K. The TFTP Protocol., Request for Comments #1350, July 1992, available through the Internet.
7. Postel, J. and Reynolds, J. File Transfer Protocol, Request for Comments #959, October 1985, available through the Internet.
8. Borenstein, N. and Freed, N. MIME (Multipurpose Internet Mail Extensions), Request for Comments #1341, June 1992, available through the Internet.









Figure 1.
Figure 2.
Figure 3.