Internet Document Access and Delivery
Walker,F.L.,Thoma, GR.
"Internet Document Access and Delivery,"
Proc. IOLS '96. Medford N.J: Information Today, 1996; 107 - 116.
Keywords: DocView, Internet, Document Delivery, TIFF, MIME
Abstract: Computer and communications technology continue to present libraries with new opportunities to serve their user's information needs. Not so long ago a library was a place where people went to read or borrow materials to learn. Recent technological advances have enabled libraries to become electronic distributors of knowledge to patrons who remain at home, offices and other places of business. Probably more than anything else, the Internet and software being designed for use on the Internet, are bringing about the changes in the way people use libraries. Internet software is making it possible for libraries to allow access to text, picture, audio and video information, whether stored locally or remotely on the Internet. New technologies also allow libraries to convert paper-based collections to electronic formats for Internet transmission to remote patrons. This paper briefly describes server-based techniques that libraries may use for allowing access to library information. Then it details developments that allow libraries to send paper-based documents over the Internet. It examines the tradeoffs between point-to-point transmission of documents, such as that used by Ariel systems, and that available through email using Multipurpose Internet Mail Extensions (MIME). This paper also describes DocView, an Internet software program being developed at the National Library of Medicine. DocView permits users to receive documents on their desktops sent from Ariel systems, and it serves as a Tagged Image File Format (TIFF) viewer for documents received from Internet-based servers. Recent DocView developments allow users to request documents directly from a library using a built-in messaging facility. Finally, this paper describes ongoing DocView developments, and shows how they will aid Internet document delivery.
1. INTRODUCTION
The use of libraries has been changing rapidly over the past twenty years due mainly to the introduction of new computer and communications technologies. Just twenty years ago most library patrons searched card catalogs prior to retrieving materials from shelves. In many libraries the computer has now replaced card catalogs. With the aid of keyword searching, computers deliver citations to books or journals considerably faster than possible through card catalogs. Many libraries have also introduced communications connections to their electronic card catalogs, enabling patrons to remotely search the library holdings databases. In addition to the electronic card catalog, two other developments are changing today's library. These are electronic documents and electronic document delivery. Recent developments in computers, software and communications make it possible to create what has been called the 'virtual library.' The virtual library is an entity that a patron may electronically visit, search its holdings, and access electronic documents, pictures, audio and video information stored at the library. All this may be done from the patron's home, school or office, no matter where it may be.
2. SERVER-BASED SOLUTIONS
The Internet is becoming an important medium for communicating library information. Libraries and information service providers have two main ways of providing information to patrons over the Internet. The first way employs a server that allows patrons to visit on-line collections. The second technique employs communication techniques for sending information upon request directly to the patron. Server-based solutions for providing access to on-line library information involve the use of File Transfer Protocol (FTP), Gopher, Wide Area Information Server (WAIS), and World Wide Web (WWW) servers. Library patrons need client software on their computers to access the appropriate server.
FTP servers are the oldest of the four types of servers; they have been around since the early days of Internet. FTP Servers allow access to information stored in files on the computer. An FTP client program allows the patron to see a list of directories and names of files stored in those directories on the server. The files may contain text, binary data, images, video or sound. Unfortunately, names given to files stored on FTP servers often fail to describe the nature of the files. Other server techniques provide better ways of describing the files. One is the Gopher server, which presents information as a series of descriptive nested menus (resembling the organization of a directory with many subdirectories and files). Gopher servers can store the same types of files as FTP servers, and Gopher client software can access information on either Gopher servers or FTP servers. In contrast to using an FTP client program, the user of a Gopher client does not need to know the location of the subdirectories or files. This is because the Gopher menu provides links to other menus or files.
A third type of server is WAIS, which is a database containing mostly text-based documents (although they may contain sound, pictures or video as well). WAIS databases may be organized in different ways, using various database systems, but the user is not required to learn the query languages of the different databases. The WAIS client uses natural language queries to find relevant documents related to the query. After a user enters a query, the WAIS server produces a list of documents, which the user may choose to view. The documents provided have no additional links to other documents.
A fourth type of server, and the most popular, is the World Wide Web server. This is a hypertext-based information system. Any word or symbol in a hypertext document can be specified as a pointer to a different hypertext document (at any other Web site, in general), where other pertinent information is located. The second document may also contain links to additional documents. As with Gopher, the user does not need to know where the referenced documents are, because they will be retrieved and displayed when needed. WWW client software programs (e.g., Netscape, Mosaic, etc.) are capable of accessing FTP, Gopher or WAIS servers, in addition to WWW servers. This versatility has made the WWW client the program of choice for potential patrons of virtual libraries.
3. DOCUMENTS FOR INTERNET SERVERS
A virtual library is not complete with just a server. Electronic documents or other information need to reside on the server so that patrons may access them. Most new information being put on the Internet, especially WWW servers, consists of hypertext documents. Hypertext documents use HyperText Markup Language (HTML), a quickly evolving standard. [1] There are several alternatives to HTML. These alternatives use other file formats, most of which cannot be directly displayed by WWW client software. External viewers are needed in these cases. One example is the so-called 'portable document format.' Examples of portable document formats are those used by Acrobat from Adobe Systems, Common Ground from Common Ground Software, Envoy from Word Perfect, and Replica from Farallon Computing. Portable document formats, as the name suggests, allow a document to be viewed on several computer platforms. For example, viewers for Adobe Acrobat are available on the three most widely used computer platforms: Windows, Macintosh and UNIX. These viewers render the document in a manner very similar to its appearance in published paper form.
In addition to portable document formats, other document format alternatives include simple text, word processed documents, and bitmapped images. Each offers advantages. If a library wants to provide an on-line collection derived from a print collection, then scanned documents (bitmapped images) offer the closest resemblance to the original documents. If resemblance to the original is not important, simple text or word processed documents may suffice. The disadvantages of using bitmapped images are the large size of the resulting files and the potentially lengthy time required to deliver the documents over the Internet. A problem with using word processed documents on a server is that client software packages may not be able to display documents from more than one word processor. The problem with using text is that graphics information is lost.
If conversion from a paper-based collection is required, the effort involved may be an important consideration in determining the document format to be used on the server. To produce text, word processing, hypertext or portable document formats, it may be necessary to manually key in the text or to use optical character recognition (OCR) to aid in conversion. OCR, however, is error-prone, depending on the printed font and quality of the original material, and often requires extensive manual editing. Bit-mapped images produced from scanning documents are highly accurate and are created considerably faster than typing or using OCR. However, the resulting file size, even if image compression is used, may preclude providing large collections of bitmapped images on-line.
It is worthwhile to know the relative numbers of document types available through the Internet. This is useful for anybody considering Internet distribution of documents. To get an estimate of the numbers of documents, searches can be made using WWW search engines. There are several WWW search engines that are available for use, each providing indexes to WWW servers for thousands of terms. A simple search was made with three different WWW search engines to find the quantities of documents available through the WWW. Searches for three types of documents were made: HTML, PDF and TIFF. Portable Document Format (PDF) documents are Adobe Acrobat format.[2] Tagged Image File Format (TIFF) documents are bitmapped images produced through a scanning process.[3]
The three search engines used are:
Inktomi, located at inktomi
Lycos, located at http://www.lycos.com/
WebCrawler, located at http://webcrawler.com/
Number of Documents Found
| .html | .tif | ||
|---|---|---|---|
| Inktomi | 1,674,426 | 7398 | 4362 |
| Lycos | 772,163 | 21126 | 7356 |
| WebCrawler | 32,318 | 983 | 169 |
The numbers in each column represent the number of documents of each type found by each of the search engines corresponding to the three search terms. From this table we can derive relative ratios of the types of documents:
Ratios of Document Types
| html/pdf | html/tif | pdf/tif | |
|---|---|---|---|
| Inktomi | 226:1 | 383:1 | 1.6:1 |
| Lycos | 36:1 | 104:1 | 2.8:1 |
| WebCrawler | 32:1 | 191:1 | 5.8:1 |
From these tables we can conclude that HTML documents represent by far the most common type of document of the three types. PDF and TIFF documents are relatively few in number, with TIFF documents falling in last place. Also, PDF documents outnumber TIFF documents by ratios of 1.6:1 to 5.8:1, depending on the search engine. What may be concluded from this is that HTML documents are relatively easy to produce, so they should be expected to be greatest in number. Also, since bitmapped images consume far greater disk resources, they should be few in number. Falling in between are Portable Document Format files. Though they consume about the same amount of disk resources as HTML documents, they are usually more difficult to produce.
For new electronic publications, information providers may find it easiest to make the documents available using HTML, the native document format of the WWW. For those publications that have already been published in paper form, it is a toss-up as to the best method of document format. The effort to convert the printed material to HTML or PDF formats is expensive, while that for TIFF format is relatively cheap. TIFF provides the best rendering of the original document, but at the expense of disk space. Bitmapped image TIFF documents also take longer to send over the Internet. This is a consideration if most users access the document database using modem dialup connections, which are considerably slower than direct Internet connections.
4. SENDING ELECTRONIC DOCUMENTS
Libraries and information providers may allow access to their document collections by systems other than servers, especially for paper documents that are not likely to be repeatedly requested. In a manner similar to facsimile transmission, paper-based documents may be scanned, converted to electronic form, and sent over the Internet to patrons. Systems for doing this are the Workstation for Interlibrary Loan (WILL)[4],[5], a prototype being developed at NLM, and Ariel, which is distributed by Research Libraries Group (RLG).[6] An Ariel system consists of a personal computer running Windows and the Ariel software. The computer has a scanner and printer attached. An operator scans a paper document using the scanner and views the images on the screen. After all pages have been scanned, the document is sent across the Internet to a remote Ariel system. The remote system prints the document similar to the way a facsimile machine prints a received document. Ariel has advantages over facsimile, because the documents may be scanned at 300 dots per inch resolution (fax is 200 dpi), and the Internet provides faster transmission and better error rates than possible through standard fax.
Ariel uses a communications protocol similar to FTP.[7] It requires that the sending and receiving workstations both be powered up for the document transmission to occur. Ariel sending stations have a feature called 'Store and Forward,' allowing the document to be stored at the sending station if it cannot contact the receiving Ariel system. Such is the case if the receiving station is powered down. When the receiving station is first powered up, it contacts the sending station, which then checks its queue for any documents to be sent to the querying station. At that time the sending station sends its documents.
An alternative to Ariel document transmission is email. Email has traditionally been used to communicate short text messages from one computer to another. It has also been one of the main reasons that so many people use the Internet. A few years ago the specifications for Internet email were revised to permit files to be attached to email messages. This is the Multipurpose Internet Mail Extensions (MIME) specification, which allows files of arbitrary type to be sent through email.[8] Given the proper MIME email client software, any file may be attached to an email message and sent to a user with MIME-compliant software. The files may consist of text, binary executables, images, audio or video information.
There are two ways in which MIME email can be used for document delivery. In the first technique the document is attached to the email message, and sent by means of Simple Mail Transfer Protocol (SMTP)[9] to a mail server. The mail server then sends it over the Internet to the email server that services the intended recipient. The document remains in the email server until the patron logs on and retrieves it. A second technique for sending email documents avoids attaching the document to the email message. Instead, the document is stored on an FTP server (perhaps on the computer where the document originates). The email message to the recipient will contain an address pointer to the FTP server, and the name of the document. The email message produced is small, since it does not contain an attached document. When the intended recipient receives the email and reads the message, he or she must initiate the FTP connection to the remote FTP server and retrieve the document. It is possible to build this FTP connection into an email package, for automated retrieval, but since there is no standard requiring this, the document may have to be manually retrieved.
A comparison of Ariel and the two types of MIME email delivery may be made, as follows. Email's advantages are first listed.
- MIME email allows the use of document formats other than CCITT Group 4 black and white images, as required by Ariel. "Documents" may be color or gray scale images, and audio or video files.
- MIME email is rapidly gaining popularity, and packages for most computer platforms are becoming available. This will make it easier to deliver documents to end users, who are unlikely to have Ariel systems at their sites, since libraries and document providers are the main users of Ariel, not end users.
- With MIME email, the sending and receiving computers do not both need to be powered up simultaneously for document transmission to occur, as required by Ariel, though its Store and Forward feature partially alleviates this problem. A server will hold the document until the receiving computer requests its mail.
- Internet Protocol (IP) addresses, used by Ariel systems, have an ephemeral nature. With modem dial-up to Serial Line Internet Protocol (SLIP) or Point-to-Point Protocol (PPP) providers, the IP address tends to change dynamically each time the call is made. This makes it difficult for Ariel system operators to send documents to patrons having dial-up connections. This is the same case in other environments where the IP address may change from session to session. One example is a Windows NT Server that uses Dynamic Host Control Protocol (DHCP). DHCP allows the NT Server to allocate IP addresses on the fly. As computers disconnect from the network, addresses are freed up and become available for new computers when they connect to the network. The assignment of the IP addresses is done automatically by a DHCP server running on the NT Server. This makes it awkward for Ariel document transmission, since it may be difficult or impossible to keep track of changes in IP addresses. The one thing in the world of Internet that is fairly slow to change is a patron's email address. For email addresses, there is nothing equivalent to the ever-changing environment of DHCP. This gives email a distinct advantage over Ariel document delivery, especially for end users.
- Firewalls allow Email delivery, but usually prevent Ariel document delivery. Firewalls are security procedures placed around some networks that prevent unauthorized entry from outside the network. While they may allow Ariel documents originating from within the firewall to exit the firewall, the reverse is usually prevented. MIME email delivery is allowed in both directions.
On the other hand, Ariel has some advantages over MIME email for document delivery:
- Document transmission time could be shorter for Ariel document delivery. One problem with the first MIME email technique (attaching a document to the message) is that some gateways on the Internet do not have large enough buffers to handle very large files. For this reason, it is necessary to split the email message into smaller components at the sending end. The receiving end must reassemble the components to get the original document. Due to this slicing and gluing, some extra time will be needed for document delivery. The time is dependent on the speed of the sending and receiving computers. For fast computers, this delay may not be noticeable.
- The recipient of MIME email must have a suitable document viewer. TIFF viewers are available for most computer platforms, but the user must usually take the extra steps in obtaining them. They are not normally included with MIME email packages.
- Depending on the sophistication of the MIME email package, the user may have to manually take extra steps in first saving an attached document, and then launching a suitable viewer to see the document. For the second MIME email delivery technique, the user may have to manually retrieve the document from an FTP server. These extra steps are not required for Ariel delivery.
Document delivery over the Internet via MIME email can be accomplished today, although manually. A paper document may be scanned, the resulting file then saved, and sent over the Internet by means of an appropriate email package. Developments are under way to integrate the process of scanning and document transmission using MIME email, with products expected to be introduced in 1996. One software tool that will take advantage of this new development is DocView, described in the next section.
5. DOCVIEW
DocView is developed by the Lister Hill National Center for Biomedical Communications, an R&D division of the National Library of Medicine, which researches and develops new methods of delivering information to the biomedical community. DocView is a prototype Windows application for the end user's desktop, providing two ways of accessing and using printed literature through the Internet.[10],[11] First, DocView permits printed documents to be received over the Internet from remotely located Ariel stations. Just as Ariel systems can send documents to one another, they can also send documents to computers running DocView. Second, DocView serves as a viewer for World Wide Web client applications such as Netscape. DocView provides viewing capability for documents stored as multi-page TIFF images. Printed literature stored in this format on World Wide Web, Gopher or FTP servers can be accessed and downloaded by Netscape, then viewed through DocView. Once DocView has received documents either from an Ariel system or an Internet client application, it provides the user with tools for using the electronic images. These tools include displaying the pages on the screen, manipulating the images (zoom, scroll, pan, rotate), copying portions of pages of interest, electronically marking desired pages, and printing the pages. With these functions DocView provides a library patron an important capability for accessing and using virtual libraries on the Internet.
In recent DocView development, a messaging facility has been added to aid in requesting documentsfrom libraries. DocView's message facility allows a user to send text messages to Ariel systems or to other DocView users. The user interface is similar to that of an email package: it lets the user create messages, address and send them, and keep track of transmission status. Since the message needs to be Ariel-compatible, DocView converts the text to Group 4 compressed images before it sends the message using Ariel-compatible communication protocols. This facility gives the library patron another way to request library documents. Document requests come to libraries in various formats: paper request forms sent by mail, telephone calls, email or facsimile. Some libraries also offer on-line ordering systems such as DOCLINE, a request and routing system used by the U.S. medical library community. The DocView-Ariel alternative is yet another way of requesting documents. It may prove to be convenient for the librarian, since the request is automatically printed at the same Ariel computer at which the document would be scanned and sent to the requester. It is also convenient for the library patron, since the request will get to the library generally faster than through some of the other request methods.
DocView beta testing started in the second half of 1995 at several libraries. Among some early reporting of minor bugs was an interesting problem reported by one patron. Over a period of six weeks that individual had received thirty documents from his library's Ariel system, but additional documents ordered were not being received. The reason was that DocView had an internal limitation of thirty documents, where each document was a complete journal article. This was a limit originally imposed by design since it was assumed that no user would ever want to have many documents occupying the local hard disk drive. It was expected that most DocView users would receive a document, perhaps view it, print it, and then delete it, rather than storing it electronically for an indefinite period.
A recent study provides insight into why people save documents received over the Internet.[12] Although this study pertains to HTML documents, it could apply to other types of Internet documents, such as those received by DocView users. The foremost reason for saving documents was to use the information in the document off-line (59.7% of respondents). Other reasons included a need to read the document off-line (50.9%), distribute to others not on-line (45.2%), and archiving the document content (30.9%). It is noteworthy that 18.8% of users saved documents in fear that the item would no longer be available. A future user satisfaction questionnaire will try to elicit such information from DocView beta testers.
DocView's developers have recognized the need to address the problem of managing large numbers of documents. With the typical personal computer being sold currently with a 1 gigabyte hard disk drive, this capacity makes it possible to retain dozens of bitmapped image documents. A recent DocView development is the addition of an aid for managing collections of documents. This is done through "folders" and "file cabinets." As in the physical world, DocView's folders permit documents to be grouped together in logical categories. For example, a user may want to group documents of similar topics or documents having the same author. Grouping documents will tend to reduce the number of documents visible on the screen at any given time, and hence make viewing easier. The user may create new folders, rename them, delete them, and move documents from one folder to another. DocView's file cabinets are the locations where folders are kept. File cabinets can be located on different hard disk drives or on different servers on a local area network. As with folders, the user may create, rename or delete file cabinets. The user may also move folders from one file cabinet to another.
In addition to the development of document management techniques for DocView, a new method of receiving documents is planned: through MIME email. As stated in the previous section, MIME email offers several advantages over Ariel for document delivery of the Internet. DocView will have its own MIME email built into the DocView messaging facility, to make the document reception and viewing seamless. DocView can also serve as a TIFF viewer for documents received through external email packages. The user will have the option of using the built-in facility or an external email package. The ability to receive documents sent through either Ariel or email communications protocols is expected to make DocView an even more useful tool for patrons of virtual libraries.
6. SUMMARY
Recent innovations in computer software and communications technology are making it possible for libraries to expand the manner in which they provide information to their patrons. Remote access to library information through the Internet is now possible. Two general techniques are available for delivering documents over the Internet. One is server-based, which allows repeated access to an on-line document collection. The second approach is point-to-point transmission such as that available through Ariel systems. DocView, which is a prototype Windows application developed at at NLM, helps the library patron take advantage of both document delivery techniques. It is capable of receiving documents sent from Ariel systems, while providing TIFF viewing capability for bitmap image documents retrieved from Internet servers. It is in the Internet delivery and handling of bitmapped image documents, which render a highly accurate electronic representation of the paper version, where DocView finds a niche. DocView is currently being beta tested in the field, and plans are being made to enhance it to take advantage of new Internet document delivery techniques such as MIME email.
7. REFERENCES
1. Specifications for HTML, available through the World Wide Web Consortium. Available over the Internet at http://www.w3.org/MarkUp/
2. Bienz, T. and Cohn, R. Portable Document Format Reference Manual. Adobe Systems Incorporated, Addison-Wesley Publishing Company, 1993.
3. TIFF Revision 6.0, Aldus Corporation, June 3, 1992.
4. Thoma GR, Hauser SE, Le DX, Muller D, Walker FL. Advances in the Management of Interlibrary Loan, Vol. 1. Internal technical report. Bethesda, MD: National Library of Medicine, Lister Hill National Center for Biomedical Communications, Communications Engineering Branch, September 1995.
5. Thoma GR, Hauser SE, Le DX, Muller D, Walker, FL. WILL: Design of a Standalone WILL Unit, Vol. 2. Internal technical report. Bethesda, MD: National Library of Medicine, Lister Hill National Center for Biomedical Communications, Communications Engineering Branch, September 1995.
6. Bharadwaj, R. The Ariel project. Proceedings of ASIS, Vol. 28 (1991); p.339.
7. Postel, J. and Reynolds, J. File Transfer Protocol, Request for Comments #959, October 1985.
8. Borenstein, N. and Freed, N. MIME (Multipurpose Internet Mail Extensions), Request for Comments #1341, June 1992.
9. Postel, JB. Simple Mail Transfer Protocol, Request for Comments #821, August 1982.
10. Walker FL, Thoma GR. Access to document images over the Internet. Proceedings IOLS'94. Medford NJ: Learned Information, 1994; 185-97.
11. Walker FL, Thoma GR. DocView: Providing Access to Printed Literature through the Internet. Proceedings IOLS'95. Medford NJ: Learned Information, 1995; 165-173.
12. GVU's Fourth WWW User Survey, Georgia Institute of Technology's Graphic, Visualization & Usability Center, 1995, available over the Internet at URL: http://www.cc.gatech.edu/gvu/user_surveys/.








