The importance of long-term geospatial data storage
Editor’s note: Punch cards, tape drives, ZIP drives: If you have been involved with computers for any substantial amount of time, these media formats represent stops along the march of technological progress. They also represent just one of the many challenges associated with long-term digital data storage. In addition to media formats that get left behind, file format obsolescence, operating system migration, device death, antiquated applications, and organization reorganization all threaten the long-term preservation of the digital datasets that can be invaluable for research and historical purposes. Geospatial digital datasets, which are typically large and often poorly documented, are a special case of this larger issue.
Esri writer Jim Baumann interviewed Stanford University’s Dr. Julie Sweetkind-Singer about her work in preserving geospatial datasets. Sweetkind-Singer currently serves as both the assistant director of Geospatial, Cartographic and Scientific Data and Services and the head of the Branner Earth Sciences Library and Map Collections at the school. She is the former librarian for the David Rumsey Historical Map Collection at Cartography Associates and is recognized by the Library of Congress as a digital preservation pioneer.
Baumann: As a recognized authority, please discuss the primary considerations for archiving and preserving digital information over the long term.
Sweetkind-Singer: From a librarian’s point of view, digital data is very different and much more difficult to preserve for extended periods of time than paper-based data. For example, a book on acid-free paper can be kept on a shelf in a cool, dark place for 100 years, and if it is well taken care of, one would expect it to remain in pretty good shape.
With digital information, you have to implement a process from the very beginning that will allow you to preserve it well into the future. This includes making sure that the data is well managed technically: that metadata exists to ensure someone in the future will understand what the data represents and how it has been stored and that legal documents are in place indicating how the data may be used in the future.
It’s important for digital archivists to develop long-term preservation plans that include both technical and legal stipulations. Unless digital files are correctly preserved and documented, we run the risk of losing the information, which is then unavailable to future generations.
Baumann: From an educator’s perspective, what are some of the key reasons to preserve geospatial data?
Sweetkind-Singer: For both educational and research purposes, it is critical that we preserve data for the long term. For example, the opportunity to trace the development of a region using historical maps is useful to researchers who are studying population growth or the change from an agriculture-based to an industry-based economy. A historian may want to know when the railroad first reached the study area, what effect the railroad played on it, what agricultural crops formerly grew there, in which direction the area began its expansion, when major roadways were built through it, and which cities they connected. You can analyze all this over time by studying geospatial data, but only if you have the content to do so. Preserving historic data and continually adding to that collection on a regular basis is a critical part of change detection research.
Baumann: How did the National Geospatial Digital Archive [NGDA] come about, and what role does it play in preserving geospatial data?
Sweetkind-Singer: The NGDA is a collaborative research effort between Stanford University and the University of California at Santa Barbara, with funding from the Library of Congress to examine the issues surrounding the long-term preservation of geospatial data. The program funded by Library of Congress is called the National Digital Information Infrastructure and Preservation Program [NDIIPP].
One of the goals of the NGDA was to set up the structure for a preservation network and eventually add more partners covering a variety of regions around the United States including both libraries and state archives. Maintaining geospatial data in various locations is one important aspect for its long-term preservation in case of man-made or natural disaster. In addition, I think it’s important to remember that many organizations may produce geospatial data but aren’t involved in its collection or preservation. However, the mandate for libraries and government archives is to preserve valuable documents for the future.
Baumann: What procedures has the NGDA recommended to facilitate the long-term storage of geospatial data?
Sweetkind-Singer: You have to assume that both the software and hardware components that originally created the data will change in the future. Given that, it’s important to have metadata for all geospatial data that is archived including details about the software that was used to create it and related white papers. We developed a registry to track information about formats because they will certainly change over time. This information was the basis of the Library of Congress’ Geospatial Content section on its Sustainability of Digital Formats website. Regarding the preservation of remotely sensed imagery, you need to know which sensors were used, when they were updated, and what software was used to interpret the data format.
Legal documents are another important part of the long-term data storage process. We drafted agreements with the participating NGDA members about collection development policies specifying what each institution is going to collect and curate. There is another contract that brokers the relationship between copyrighted or licensed data and the university that wants to archive it. Data providers want their data preserved, but as a university, we have to have assurances that our faculty and students can use that data for research and educational purposes. So we have contracts that specify the acceptable use of the archived data. I think long-term data preservation is a matter of developing a plan that includes technical solutions from the IT department, as well as recommendations from librarians, archivists, and lawyers, to make sure geospatial data is properly and legally preserved for the future.
Baumann: Please describe some of the key datasets that you have collected for the Stanford University archive.
Sweetkind-Singer: One of the first datasets we archived was the David Rumsey digital map collection. David Rumsey is a map collector in San Francisco who has spent many years building a fine collection of maps, atlases, and books detailing the growth of cartography in the United States during the eighteenth and nineteenth centuries. About 10 years ago, he decided to enhance his collection by scanning it and making those images available to the general public. Today, he has more than 29,000 items in the digital collection. David uses the digital maps in a variety of ways that are impossible with the printed versions. However, he doesn’t have a robust and secure way to store those digital images for the future. Working together, we were able to provide secure, long-term preservation of the imagery as well as the accompanying metadata.
We also worked with the CSIL [California Spatial Information Library], a government agency tasked with maintaining geospatial data for the state of California. CSIL collects transportation data, Landsat imagery, SPOT imagery, and other content. CSIL is the primary source of California statewide data. In addition, we have downloaded data from the USGS [US Geological Survey] Seamless Data Warehouse. In conversations with John Faundeen, the archivist at the USGS EROS [Earth Resources Observation and Science] Data Center, he was happy to hear that we were downloading high-resolution orthoimagery of the San Francisco Bay Area from the site and archiving it as part of our collection process.
Baumann: As Stanford continues to build its spatial data archives, what do you hope to add to your collection in the near future?
Sweetkind-Singer: We have collected a fair amount of high-resolution orthoimagery for the Bay Area and recently added the elevation data that goes along with it so that researchers can do three-dimensional modeling using the imagery sitting on top of the elevation data. I would also like to collect more datasets for the California National Parks and the state’s coastline data. Important content for our collection is local data from places like the Hopkins Marine Station, which is Stanford’s marine biology station in Monterey [California]. [At Hopkins,]they’ve collected a large amount of heterogeneous data types: imagery, fish populations, transect information, and weather data. Our future data collection activities range from very specific content, such as the Hopkins Marine Station data, to very broad layers like the National Elevation Dataset for the United States.
Baumann: Are standard procedures for the preservation of geospatial data widely implemented in libraries and government archives today?
Sweetkind-Singer: I think that the long-term preservation of data is something that is just emerging as an issue for libraries. While many libraries and state archives are aware of the problem, they don’t really know how to tackle it yet. It may seem at first like an overwhelming task, but breaking the procedure down into its component parts will make the process achievable.
One important effort that has emerged over the past few years, also funded by NDIIPP, is the Geospatial Data Preservation Resource Center. This site has been designed specifically to bring together “freely available web-based resources about the preservation of geospatial information.” It also gives practitioners a place to start, discover best practices, and get their questions answered.
As we go forward, we will figure out sustainable methods to manage, archive, preserve, and create access to digital information, but relatively speaking, we’re in the early days. It’s a process that we’ll develop and refine as we continue to work with this type of content. Long-term data archiving is a very interesting and challenging area for libraries because we are building the digital collections of the future. Libraries have an important role to play in making sure that we provide proper stewardship and preservation of geospatial data.