- A Framework of Guidance for Building Good Digital Collections
Objects Principle 1
Objects Principle 1: A good object exists in a format that supports its intended current and future use.
Consequently, a good object is exchangeable across platforms, broadly accessible, and formatted according to a recognized standard or best practice.
There is a direct correlation between the production quality of a digitized object and the readiness and flexibility with which that object may be used, reused, and migrated across platforms. As a result, the creation of digital objects at the appropriate level of quality can pay off in the long run as the objects are rendered more useful and accessible over the longer term. An object intended to have long-term value should be formatted to render it exchangeable across platforms and broadly accessible. Not all objects, of course, will have long-term value. A project needs to assess the value of the digital objects in its collections and make appropriate decisions about persistence and interoperability.
When speaking of digital content, the word “format” carries multiple meanings. Several of these are discussed in an introductory essay to the Sustainability of Digital Formats: Planning for Library of Congress Collections website (http://www.digitalpreservation.gov/formats/). In the context of this document, two of the most important meanings pertain to file formats and bitstream encodings. File formats are generally identified by file extensions (e.g., .mp3) or MIME (e.g., text/html). Bitstream encodings underlie certain file formats, e.g., the linear pulse code modulated (LPCM) waveforms that may be found in WAVE or AIFF files, or the H.264 video that may be found in QuickTime or MPEG-4 files. Those two encodings are specific to content category (in these cases, audio and video), while others are generic, e.g., LZW (Lempel- Ziv-Welch compression encoding). There are few strict correlations between file formats and encodings.
A variety of international efforts have been launched to document digital formats and to provide tools to help manage them. Important examples include the following:
- Global Digital Format Registry website http://hul.harvard.edu/gdfr/. A model and implementation of interoperating distributed format registries, initiated in the United States. See also Stephen L. Abrams and David Seaman, Towards a Global Digital Format Registry (2004) http://www.ifla.org/IV/ifla69/papers/128e-Abrams_Seaman.pdf.
- PRONOM and DROID website http://www.nationalarchives.gov.uk/pronom/. Two tools developed by The National Archives (United Kingdom). PRONOM is an online registry of technical information about file formats, software products, and related topics. DROID is an automatic file format identification tool.
- Automatic Obsolescence Notification System (AONS) website http://pilot.apsr.edu.au/wiki/index.php/AONS_II. Now under development by the National Library of Australian (NLA) and the Australian Partnership for Sustainable Repositories (APSR), AONS will be a platform-independent, downloadable tool that automatically provides information from authoritative international registries informing users when file formats in their repositories are obsolete or at risk of becoming obsolete.
Information about file formats and encodings is provided in the two tables below: one for reformatting activities and one for the acquisition of born-digital content. The understanding and application of digital technology to library and archive content and its preservation has moved unevenly across the various content types (and sometimes subtypes) that are listed in the tables below. The resulting variation is reflected in the number and quality of references cited, and in the confidence that the compilers of this document bring to each content type.
|Content Category||Target Formats||References and Comments|
|Printed matter and manuscripts, not oversize (images of pages)||Master images as uncompressed TIFF files (well established) or lossless compressed JPEG2000 (emerging practice). Service-image formats for access vary according to delivery system, generally favoring formats supported natively in browsers or via free, widely available plug- ins, e.g., JPEG and PDF.||
Quality factors include bit depth and spatial resolution; these vary from project to project, with most selecting grayscale or color images in the 300-600 ppi range, at 8 or 24 bits per pixel, for master images, and few selecting bitonal. Special use cases can motivate a project to scan at higher levels of resolution; certain classes of old manuscripts, for example, have been digitized at levels as high as 2,400 ppi for study by scholars.
The most thorough general treatment of raster imaging is the 2004 document from the National Archives and Records Administration: Technical Guidelines for Digitizing Archival Materials for Electronic Access: Creation of Production Master Files — Raster Images (http://www.archives.gov/research_room/arc/arc_info/techguide_raster_june2004 .pdf). Another example of general coverage of raster imaging is the California Digital Library: Digital Image Format Standards (http://www.cdlib.org/news/pdf/CDLImageStd-2001.pdf). A useful document limited to printed matter is Benchmark for Faithful Digital Reproductions of Monographs and Serials (http://www.diglib.org/standards/bmarkfin.htm), Version 1 (DLF, 2002). Meetings of the JPEG 2000 in Archives and Libraries Interest Group (http://j2karclib.info/) have highlighted growing interest in employing JPEG 2000 images as masters or archival formats in reformatting projects. Although particular to newspapers (and related to the scanning of microfilm rather than paper), guidelines potentially of broad utility are offered by the Library of Congress National Digital Newspaper Program (NDNP) Technical Guidelines for Applicants (http://www.loc.gov/ndnp/pdf/ndnp_techguide.pdf – click link to Newspaper Digitization). Meanwhile, a new activity within the NDIIPP project at the Library of Congress is bringing together several federal agencies, including NARA and the Government Printing Office, to develop guidelines and standards for use within the federal government. A public website is expected before the end of 2007.
|Printed matter and manuscripts (machine-readable texts)||Master files as marked- up texts, generally within an established XML schema or DTD, e.g., TEI or TEI-lite. Formats for access vary according to the requirements of indexing and/or delivery systems.||Most online presentations depend upon images for the authoritative representation of content, with text accuracy considered as satisfactory if sufficient to support search and retrieval. Quality factors generally focus on the degree and correctness of markup. General information on markup is provided by Creating and Documenting Electronic Texts (http://www.tei-c.org/) (TEI) is an important initiative; see TEI Text Encoding in Libraries: Guidelines for Best Encoding Practices (http://www.diglib.org/standards/tei.htm) (DLF, 1999). Some additional links (including ALTO, the Analyzed Layout and Text Object) are provided by the National Digital Newspaper Project (http://www.loc.gov/ndnp/metadatalinks.html). Many online presentations of manuscripts do not include machine-readable texts.|
|Pictorial materials (reflected light)||Master images as uncompressed TIFF files (well established) or lossless compressed JPEG2000 (future practice). Derivative formats for access vary according to delivery system, often formats supported natively in browsers, most often JPEG.||Quality factors include bit depth and spatial resolution; these vary from project to project, with most selecting grayscale or color images in the 300-600 ppi range, at 8 or 24 bits per pixel, for master images. Some projects may scan at higher levels because their designated community wishes to examine very small or subtle features of the original. The most thorough general treatment of raster imaging is the 2004 document from the National Archives and Records Administration: Technical Guidelines for Digitizing Archival Materials for Electronic Access: Creation of Production Master Files - Raster Images (http://www.archives.gov/research_room/arc/arc_info/techguide_raster_june2004.pdf) (June 2004). Another example of general coverage of raster imaging is the California Digital Library Digital Image Format Standards (http://www.cdlib.org/news/pdf/CDLImageStd-2001.pdf).
New approaches are under consideration; see recent papers by NARA’s Steve Puglia and Erin Rhodes (examples: http://www.rlg.org/en/page.php?Page_ID=21033 - article2 and http://www.imaging.org/conferences/archiving2007/details.cfm?pass=50). Meetings of the JPEG 2000 in Archives and Libraries Interest Group (http://j2karclib.info/) have highlighted growing interest in employing JPEG 2000 images as masters or archival formats in reformatting projects.
|Pictorial materials (negatives, other transmitted light)||Same as above today, practice may vary in future.||Today’s practices are generally very similar to the preceding, except for an emerging preference for higher bit depth (“extended data range”) to accommodate all of the information in a photographic negative. The Library of Congress frequently scans black-and-white negatives as 16-bit grayscale images (for example, see http://memory.loc.gov/ammem/collections/anseladams/aambuild.html). Spatial resolution at the level of a negative (often much smaller than a print) is often expressed in terms of overall pixel count, e.g., “5,000 pixels on the long side.” Some discussion of emerging practices will be found in recent papers by NARA’s Steve Puglia and Erin Rhodes (examples: http://www.rlg.org/en/page.php?Page_ID=21033 - article2 and http://www.imaging.org/conferences/archiving2007/details.cfm?pass=50).|
|Oversize typographic or pictorial materials||Master images generally employ the same formats as the preceding. Delivery to users, however, often exploits the scaling or tiling functionality of formats like JPEG 2000, MrSID, or DjVu, with on-the-fly creation of GIF or JPEG images for end users.||Scant information is provided by pages like http://memory.loc.gov/ammem/help/mrsid.html at the Library of Congress and http://www.delamare.unr.edu/maps/digitalcollections/nvmaps/siteinfo.html at the University of Nevada, Reno.|
|Sound recordings, no synchronized transcription (music or speech)||Masters should consist of a linear PCM bitstream, which may be wrapped in a WAVE, AIFF, or Broadcast WAVE file, which can include a bit more metadata. End- user delivery formats are typically MP3, QuickTime, WindowsMedia, and RealAudio.||INTRODUCTORY DISCUSSIONS:
Useful information pertaining to the digitization of speech is offered by the National Gallery of the Spoken Word (NGSW) projects, based at Michigan State University (http://www.historicalvoices.org/papers/audio_digitization.pdf and http://www.historicalvoices.org/papers/sounds.rtf). The transcription and translation of spoken word content (as of 2002) is described in this report from a working group supported in the US by the NSF and in the EU by DELOS (http://www.dcs.shef.ac.uk/spandh/projects/swag/). See also the next table row for information about the synchronization of sound content and transcriptions (which may be musical as well as textual).
|Sound recordings with synchronized transcriptions (music or speech, e.g., oral histories)||For sound formats, see the preceding table row. Synchronized music notation and text formats represent emerging practices; see the examples cited in this row.||INTRODUCTORY DISCUSSIONS:
The Spoken Word Project (http://www.at.northwestern.edu/spoken/) at Northwestern University features information on synchronizing transcripts and sound recordings.
See the Northwestern University resources cited above. Examples of projects with synchronization: the OYEZ multimedia archive devoted to the Supreme Court of the United States (http://www.oyez.org/); see examples like the William O. Douglas interviews (http://www.oyez.org/justices/william_o_douglas/interview-tapes/), which uses Adobe Flash to present synchronized content to end users.
|Moving images, video recordings on conventional tangible media (analog and digital videotapes, DVDs)||See comments at right and list of resources in next table row.||CURRENT PRACTICE, HYBRID APPROACH:
For the reformatting of videotapes, most archives continue to produce a new videotape as a preservation master, typically a Beta SX (DigiBeta); some archives may use the more expensive D1, D5, or other types. All of these magnetic tape formats are obsolete, however, and may require re-reformatting within a decade. Service copies are generally digital files: in a high-bandwidth LAN, high-bit-rate MPEG-2 or MPEG- 4 files in larger picture sizes; for lower bandwidth applications and the Web, lower- rate MPEG-4, RealVideo, or QuickTime formats with smaller picture sizes. A good introduction is provided by the Association of Moving Image Archivists (AMIA) in Reformatting for Preservation: Understanding Tape Formats and Other Conversion Issues (http://www.amianet.org/resources/guides/storage_standards.pdf).
EXPLORING FILE-BASED MASTERS:
Little in the way of fully realized, experience-based documentation exists for this approach; much must be gleaned from e-mail discussion lists and personal communication. One useful guideline for making files containing uncompressed video streams is Standards Analysis for Video Objects: Recommended minimum requirements for preservation sampling of moving image objects, by Isaiah Beard for the Rutgers University RUcore project (http://rucore.libraries.rutgers.edu/collab/ref/dos_avwg_video_obj_standard.pdf). Meanwhile, several experts advocate preservation masters that employ a “frame-by- frame” approach; individual frame images may be uncompressed or encoded as JPEG 2000 (lossless or lossy), within a suitable wrapper (MXF, Motion JPEG 2000, AVI, others); or as MPEG-2 or MPEG-4 “all I frame” encodings; or even as DV. For the MPEG and DV lossy encodings, higher data rates (e.g., 50 mbps) are preferred to lower. Reformatting (to tapes as well as files) often requires transcoding, e.g., from composite to component color space and, for compressed formats, to compress the signal. In contrast, it is possible to extract the native digital signal from formats like DVDs (MPEG-2) of DV/DVC/DVCPRO videotapes (DV), but there seems to be no established practice for this. Making a file entails placing the encoded digital essence in a wrapper, e.g., MXF, Motion JPEG 2000, AVI, QuickTime, MPEG-4, but again, the community has not yet established practices.
Sound may be interleaved with the video in the “stream,” or may be managed as a separate element within several wrapper formats (e.g., MXF, Motion JPEG 2000, AVI). Audio encoding may be uncompressed linear PCM or compressed (usually lossy) in an encoding that is accepted by the wrapper.
|Moving images, video recordings on conventional tangible media (analog and digital videotapes, DVDs). Continued from preceding row.||List of resources at right.||As the preceding row indicates, a number of players with a variety of ideas (conventional and cutting-edge) are exploring the conservation of older videotapes and best practices for reformatting them.
|Moving image (film)||See comments at right||CURRENT PRACTICES:
Virtually all archives today employ the well understood and well established approach of traditional photochemical reproduction. The original film is printed onto an appropriate film stock, which is developed in the conventional way, yielding archival masters. Depending upon the configuration of the starting material (i.e., negative or positive, sound or silent) the nature and number of the archival preservation masters varies. One or two additional generations may be printed and developed, yielding duplicating copies and a print for projection. Prints may also be made directly from the original materials when appropriate. Guideline documents include the 2004 Film Preservation Guide: The Basics for Archives, Libraries, and Museums (http://www.filmpreservation.org/preservation/film_guide.html) and the Film Preservation Handbook from Australia’s National Film and Sound Archive (http://www.nfsa.afc.gov.au/preservation/film_handbook/).
The extensive use of digital technology by commercial filmmakers will lead to changes in reformatting in archives during the next few years. Here are two likely approaches:
The information in Table 2 is tentative. Many of the preferred formats listed are those suggested by the Library of Congress Sustainability of Digital Formats website (http://www.digitalpreservation.gov/formats/index.shtml), which emphasizes that its recommendations are provisional. This table, like that website, is written from the perspective of institutions that are likely to receive content from creators not under their control. The Library of Congress, for example, receives content in many ways, ranging from copyright deposit to the donation of personal papers with boxes of floppy disks. Thus the table below allows for the possibility that incoming content may take the form of, say, PDF files or even word-processing files. However, where an institution has any control over born digital content, it is highly desirable to encourage authors to create their works in specified formats. For example, in some scholarly projects, it may be possible to insist that authors create their documents in XML, following guidelines like those from the Text Encoding Initiative (http://www.tei-c.org/). Similarly the graduate schools of many universities compel students to submit electronic theses and dissertations in library-approved formats.
Three topics have general applicability and are not articulated row-by-row in the table below. The first is the strong preference of libraries and archives to receive content that includes metadata, whether embedded in a file or as an associated “sidecar.” The second concerns digital works that arrive at an archive in a hard-to-sustain format, prompting the archive to transcode the content into an easier-to-sustain format. Several archivists argue that such content should be kept in both the original form (even though it may be very hard to read in the future) and in the migrated, easier-to-sustain form.
The third general topic has to do with technological protections, the “locking” of content associated with Digital Rights Management (DRM) regimes. Some formats have embedded capabilities to restrict use, say, by time period, to a particular computer or other hardware device, or by requiring a password or active network connection. Since the exploitation of the technical protection mechanisms for a given format is usually optional, this consideration arises when a format is used in a particular business context, e.g., the sale of downloadable music from entities like Apple iTunes. To preserve digital content and provide service to users and designated communities decades hence, custodians must be able to replicate the content on new media, migrate and normalize it in the face of changing technology, and disseminate it to users at a resolution consistent with network bandwidth constraints.
The preceding paragraph offers a problem statement but does not cite examples of library or archive practice that address the matter. In fact, the compilers of this document believe that libraries and archives have only slight experience with DRM or any other aspects of born digital content. Few of us have direct experience with many of the formats listed below. Therefore, we strongly encourage our readers to enrich this table with fresh information or links to relevant resources.
|Content Category||Target Formats||Preferred Formats||References and Comments|
|Textual content, monographic (emphasizing layout or typography)||PDF, PDF/A, various word-processing formats, other||Text formats should represent the underlying text in a way that is accessible to search engines. Preferred are PDF/A or other PDF subtypes created from machine-readable text (as opposed to page images). HTML (hierarchy or network of linked pages) are acceptable if published/disseminated only in this form. Proprietary binary formats used by word- processing and desktop- publishing software are not a good choice for long-term management; text documents in such formats should be printed to PDF (preferably PDF/A) and/or converted to a transparent non-proprietary format such as OpenOffice, which is XML-based.||Guidelines for Creating Archival Quality PDF Files, Florida Center for Library Automation (http://www.fcla.edu/digitalArchive/pdfs/PDFGuideline.pdf). PDF/A –Format – Status and Practical Experiences, Presentation at the European Document Lifecycle Management (DLM) Forum, 2006 ()|
|Textual content, monographic (marked up)||XML, SGML, HTML||Preferred: XML or SGML using standard or well-known DTD or schema appropriate to a particular textual genre.||Open eBook Forum (http://www.openebook.org/). Supporting Documentation for ANSI/NISO Z39.86, Specifications for the Digital Talking Book|
Last updated: 09/03/2008