How to Make a Digital Archive?

This entry was posted by on Sunday, 1 February, 2009 at

OK, in the great tradition of jwz, I must now ask das internets for their opinion on how to proceed on a project. I’ve been asking friends for opinions on this, and I’d like to know what others think.

Here’s the backstory. I’m now an orphan, which at age 36, is a bit freaky. I’m left wondering whether the first 20 years of my life ever happened. Why? There’s no evidence of it left. No more mom or dad, and no more house I grew up in either (it was sold and gut-rehabbed a couple of years ago.) Was my childhood a hallucination? Am I a Cylon?

All that remains is three gigantic boxes of photos and documents extending back through most of the 20th century. They somewhat tell the story of me, my parents, and my grandparents (and a wee bit about my great-grandparents, all eight of whom emigrated to the U.S. around 1900.) Specifically, I’ve got:

  • A bazillion photos (some labeled, some not)
  • Personal letters, notes, drawings, poems, journals, articles
  • Bits of video

My goal is to organize this stuff into something coherent, so that there’s something I can pass down to my descendants. I’d like to create an archive in both physical and digital form that will last 100 years. I’m imagining that I’ll scan everything to disk, annotate as much as I can, and then re-print everything on acid-free paper. I’ll hand my kids the paper albums and a hard disk, with explicit instructions to re-copy the digital data to new media every 10 years or so. (With the understanding that media will change constantly — holographic storage, quantum storage, whatever…)

But my big question is: what formats do I store data in? Which file formats will still be comprehensible in 100 years?

“Paper”, you say. Duh, well sure, that’s why I’m also printing it all out. But the digital form will be far more convenient over the next N decades. I want my descendants to be able to throw this extra terabyte of genealogical data onto their wristwatch… keep it on their iPod Femto for convenience.

  • Should I scan photos to jpg or tiff? The massive size of a tiff file feels ugly to me, but that size difference will be meaningless in 30 years, and maybe the lossless-ness should be what’s important. (Imagine people in 1983 arguing about whether to include a 5kb or 50kb file in a time capsule!)
  • Should documents be scanned to PDF? Tiff? PDF seems really convenient in terms of being able to encapsulate a multiple-page document, as opposed to awkwardly dumping each page of the document as a separate tiff file. But will PDF be readable in 100 years?
  • Video: mpeg4? Which format is the most openly documented, and will still be decodable in the future?
  • Metadata: how do I describe every document? My naive idea is to create an ASCII text file (probably still readable in 100 years!) which lists each file included in the archive by name, and explains the context and signficance of each. The same document would probably include a basic biography of each family member.

Thoughts and comments are most welcome.

16 Comments to How to Make a Digital Archive?

  1. Adrian Holovaty says:

    February 1st, 2009 at 10:23 pm

    This doesn’t answer your question of file formats, but you might want to check out the Chicago-based startup called LifeSnapz (http://lifesnapz.com/) for a place to store all of the metadata. It’s kind of a life events/timeline system that lets you get pretty specific with storing photos, etc. Seems to be worth exploring at the very least.

  2. Ben Collins-Sussman says:

    February 1st, 2009 at 10:35 pm

    Looks like a cool website. But it seems to be focused on the problem of “how do i share events and timelines with people *now*”, rather than “how do I share them with future generations”. It seems to be that putting the data into a company-owned private silo is exactly the wrong strategy if I want people to see the data in 100 years!

  3. Martin Javorek says:

    February 2nd, 2009 at 4:21 am

    One question is format. If you turn back and see the IT world before 10-15 years, there were several formats which we are able to read now. Choosing the mainstream is always better, for photos can be JPG enough (corresponds to old photo you found in a box under the bed) but TIFF better if you can play with details. Audio/video, hard to choose in these wild times (image codecs are from my point of view much more stable in these days than audio/video). I’ll choose mp3/ogg for audio, mpeg4 for video. Well documented, wide spread.

    Just an idea – you can add a format description or some software on the archive side by side to your data. Yes, that software will not work after 20 years, but after 10 years it can and if that format will be hardly to read or use, you can find some emulator for your SW or program your own using format description πŸ™‚ I used this way one old software to convert my data more than 10 years old to something newer.

    But – second question/issue is data storage medium. DVD? Blue ray? Hard disk? Flash? Some tape? All these mediums are very limited in terms of time and data loss. CD a loosing their data after 10 years? (I have some more than 10 years old without any problem with reading, but they are quite good and expensive mediums and discs are stored in good conditions). What about disks? Will they store data without problems more than 50 years? You will reach probably the end of disk data interface (IDE, SATA,…) earlier than the magnetic data storage. But it’s the problem. You should maybe recopy whole digital archive every 10 years from one medium type to another one. I call it “digital archive loop” πŸ™

    I’m storing my actual photos (from now and for now) on two separate disks, completely mirrored (one in notebook, one network disk). Maybe better to choose different type of mediums (DVD, disk) and store them separately (one in house, second buried in the garden πŸ™‚ ).

  4. Justin Mason says:

    February 2nd, 2009 at 5:24 am

    Bruce Sterling’s “Dead Media Project” would have been relevant, but ironically its various archives have died: http://en.wikipedia.org/wiki/Dead_Media_Project

    The notes may still have some useful bits: http://www.deadmedia.org/notes/index-cat.html

    I’d argue against any format that is in any way proprietary, Quicktime especially. Stick with open formats. At least LZW-compressed TIFF is pretty standardised and open… but be warned, there _are_ proprietary extensions to TIFF, iirc.

    MPEG4 has patent problems, iirc, so is best considered borderline-proprietary. πŸ™ MPEG2 would be better.

    As for text — ASCII all the way πŸ˜‰

  5. Scott Plumlee says:

    February 2nd, 2009 at 6:23 am

    Perhaps saying “what’s going to be the right plan so I can read this 3 years from now, and write scripts that let me update the formats automatically at that point” would be better? I don’t think there’s any digital format that could be said to have a 100 year lifespan right now. Go with a lossless solution for everything, regardless of whether it’s a doc or photo, and covert every few years.

    Mark Pilgrim has been down the “storage” road a couple of times – http://diveintomark.org/archives/2006/05/08/backup#comment-6406

  6. Karl Fogel says:

    February 2nd, 2009 at 12:57 pm

    Regarding storage sizes: I think it makes sense to never compromise on format or losslessness just for the sake of saving storage space. I have a terabyte hard drive sitting on my desk right now, for something like $200 (maybe it was $250). I expect my next computer to have about a terabyte internally, and my external drives to be multi-terabyte. Media is also not really an issue. It’s just bits, after all; just keep transferring them from drive to drive and backup medium to backup medium. Even when on a quantum holographic storage cube the size of a sugar crystal and holding an exabyte, the data itself is still just 1s and 0s.

    Regarding format changes: it will never be the case that the data becomes impossible to convert. Instead, it’s just a question of how expensive it is to do so. (I just last night got a DVD from my mom, which she burned from a videotape, which was converted from original 8mm movie camera tapes, of her and her family romping around in black-and-white in the mid-1940s. The only step in that conversion chain that cost any money was the 8mm->VHS step, and even that was relatively cheap — I think a commercial service did it.)

    So pick the formats that preserve the most information and that lend themselves to lossless transformation to other formats (for example, WAV for audio?).

    I don’t know much about video formats, so can’t say whether mpeg4 is a good choice technically. But (re Justin’s comment) as far as patent concerns go, why worry? You’re looking at a 100 year time frame, and patents last at the most 20 or 21 years. By the time it matters for your descendants, mpeg4 will be truly free if it’s not already.

    Also: cloud hosting. Not as your only storage option, obviously (like you said, you don’t your only copy in a privately-owned data silo), but as an offsite backup that also provides easy access to potentially distributed descendents.

    One final thought: yes, add as much metadata as you can right now. Make subtitle files for any videos (that doesn’t affect the video, it’s an out-of-band addition that’s indexed by timing), add captions for pictures, etc. You will already find that there are people in some of those photos whom you don’t recognize, and of course you can’t ask your parents. You may have to get other relatives from that generation to identify who is who in some cases. If you don’t add that data now, it will be irrecoverable.

    It sounds like you’re already on board with all that. I’m just harping on it because I was watching that DVD from my mom and realizing that I can’t tell who most of the people in it are… but she can. Next time I’m home we’re going go over it from start to finish, with timings display turned on and a pad of paper at hand, and write down every person we can identify in every scene.

    Hmmm. I guess, if one were planning ahead, that this would also mean adding metata to any photos, recordings, and videos you make now. You’re doing that, right? πŸ™‚

    -Karl

  7. Jack Repenning says:

    February 2nd, 2009 at 4:45 pm

    I think you can safely assume that no digital format available today will still be usable in 100 years. You have only to look at the format wars over DVD formats to see that the driving forces are about exclusive commercial advantage in the next six months, not serving the end user or preserving the archives. I find efforts like the Hill Museum (http://speakingoffaith.publicradio.org/programs/2009/preserving-words/), to record manuscripts into digital form, a bit frightening. Socrates, as I’m sure you know, pointed out that paper was actually a device to free us to forget things; in the digital age, even the effort of forgetting has been taken over by the tools.

    To combat this automated Alzheimer’s, you’ll need either to repose your record in some cloud vendor, who will undertake to convert the formats as time goes on (say, a genealogy site), and/or to take advantage of some peer-to-peer participation, so the many copies have a better chance of having at least one be up-converted.

    Maybe both. Your data longevity plan, just like your data collection plan, should include lots of bugging relatives and friends for contributions, as well as seeking out new relatives and joining up genealogical fragments.

  8. John Aldridge says:

    February 2nd, 2009 at 8:21 pm

    There’s a subset of PDF which is an ISO standard for long term preservation. If you do create PDFs for any of this, it might be worth telling your authoring software to create PDF/A compliant files.

    http://en.wikipedia.org/wiki/PDF/A

  9. Joe Block says:

    February 2nd, 2009 at 8:38 pm

    Stick with lossless open formats. In addition to including an html version of the spec, you should also consider including source for a library that reads it. In 100 years, your descendants may need every hint they can get.

    In your instructions to your heirs, you should also tell them to never delete an old version, just file it under old when they update to a new version on a new drive – with the way disk sizes keep growing, no matter how much data you generate, it’ll be trivial to keep a copy in the old format in a subdirectory when you do the decade update.

    I’d go with PDF + TIFF of individual pages for documents.

  10. bittersweet.sage says:

    February 5th, 2009 at 10:26 pm

    I just wanted to emphasize the importance of the physical archive (without denigrating the importance of the digital). From a historic and personal perspective, the original documents are far more valuable. For example, I have one of the original posters from when my great-grandfather sold his farm to move to town and become a blacksmith. That peeling, dog-eared piece of paper could never be replaced.

  11. Ben Collins-Sussman says:

    February 5th, 2009 at 11:37 pm

    My dear sage: I hate to sound blasphemous, but I find basically zero sentimental value in ancient photographs and documents. The only objects that hold meaning to me are the few which “represent” the person, or were incredibly important to the person: for example, my dad’s pen. But as for the stacks of cracked photos and peeling papers, it’s only the *information* that’s valuable to me. My intent is to scan that stuff so as to preserve the information — and then likely toss the originals (if they’re in bad enough shape.) It’s simply not sustainable for each generation to packrat the stacks of the prior generation… unless we’re talking about hard disks. πŸ™‚

  12. Eric A. Duesing says:

    February 6th, 2009 at 12:19 pm

    On a tangent for sure: Take a look at this for quick scanning the existing photos. http://www.nytimes.com/2008/08/14/technology/personaltech/14pogue.html

  13. bittersweet.sage says:

    February 9th, 2009 at 1:15 pm

    BC-S wrote: “and then likely toss the originals (if theyÒ€ℒre in bad enough shape.) ”

    That is precisely why archeologists consider garbage dumps to be gold mines of data. Future civilizations will probably learn more about us from the remnants of our landfills than from the ruins of our museums.

  14. Ben Collins-Sussman says:

    February 9th, 2009 at 2:12 pm

    I guess I’ll have to retract my “toss the originals” idea. My geneologist inlaws just threatened to kill me. πŸ™‚

  15. steve8pi says:

    February 28th, 2009 at 12:02 pm

    Archiving application programs that read/translate/display the files may not be a bad idea. Note that there are emulators that will still run ancient programs from the early days of (personal) computing. Maybe in 100 years there will be an emulator that can simulate a popular computer that will exist in 75 years from now. And that emulated computer could in turn run another emulator that can emulate a computer that will exist about 50 yrs from now, etc… In this way, the string of bootstrapped emulators could probably execute programs to read your saved data.

  16. Rachel R. says:

    September 20th, 2009 at 3:27 pm

    I work with people who are experts on this kind of thing. Want me to ask them?