System for Preservation of Electronic Resources (SPER)

Project Member(s): George Thoma, Dharitri Misra, Siyuan Chen

The preservation of digital resources raises important and in many cases, yet unanswered questions, offering a rich field for active research in computer science and engineering. Our Digital Preservation Research (DPR) project focuses on some of the key functions of an economical and robust digital preservation system through the development of a System for Preservation of Electronic Resources (SPER). SPER consists of three basic functions: ingest, metadata extraction, and file migration. It is OAIS compliant and designed in a modular and cost-effective manner. Its basic infrastructure functions (e.g., ingest, low level bit moving) are implemented through MIT's open source DSpace, while other key functions are implemented by in-house modules. Two in particular are automated metadata extraction and file migration.

Extracting metadata automatically from the contents of material that need to be preserved, rather than relying on manual entry, is probably the only way large collections can be economically preserved. Our research seeks to develop document image analysis and machine learning techniques to automatically extract descriptive and technical metadata from a variety of document formats that exist in NLM's collections, including scanned document images, online journal articles and our Web pages that carry a permanent rating. We have developed modules based on Support Vector Machines, Hidden Markov Models and pattern matching to implement automated metadata extraction.

Techniques, such as DocMorph and MyMorph, are also being developed by CEB to automate the migration of files in bulk. This is important for the conversion of files in formats that face obsolescence, largely because they are no longer supported by newer software and modern computers, and will be inaccessible as time passes. DocMorph is a service that SPER accesses for file migration.

SPER is used currently to preserve an early 20th century medico-legal collection of court cases from the Food and Drug Administration acquired by NLM's History of Medicine Division.

These intramural R&D activities, and the use of our SPER prototype in a real world application, are essential for the optimum design of operational systems we can employ to fulfill our mandate in preserving the biomedical literature.