Is the Cloud the best thing for Personal Digital Preservation?

Analogies and metaphors are powerful tools that help us explain and visualize issues better, but sometime if the correct metaphor is not defined, it constrains and limits our ways of thinking.

Long term preservation and protection of digital data, is a problem usually being related with scientific content repositories, but is expanding, as our life’s get digitized more and more, well into our personal data as well.

On the infrastructure side a number of different alternatives have evolved, which can be best classified as Self hosted, Community based or Cloud based. Each one of these approaches can use parts of the others, .e.g a community based approach can use Public or Private Cloud IaaS elements, and sometime it is essential to do so. However, during the previous years an amount of hype has developed, which regarded the Public Cloud as a silver bullet for every infrastructure need, including preservation. And even more no questions or second thoughts we “permitted” to be asked, since it involved the new magic of “the Cloud’. Furthermore, archived related variants such as Amazon Glacier, which promised the potential to be used for files and digital objects preservation, started to appear.

In this post I will examine the following proposition: is a Cloud only based approach to digital preservation the best way to do things? Are the characteristics of the Cloud really appropriate for this kind of service, at this point in time at least?

In my opinion the  Cloud is really like a power plug on the wall, but in order to have digital preservation you really need, in the long run, a vault.

1 8bVzvJkSYcTOzePHOygJNA

The Cloud actually, and its predecessor in computing hype, the “Grid” was conveiced in an actual analogy to the electrical power generating and distribution Grid. The goal was simple: get computing and storage capacity in demand from everywhere no strings attached. The analogy is actually quite correct since most of the characteristics of the current Cloud technologies are similar to the ones in the Electrical grid:

  • Energy, like computing and storage nodes in the Cloud, can be generated and delivered on demand.
  • Plug in a device and you get the same service (more or less) everywhere on the planet. Make a virtual machine and when the proper tools are installed the service you get is the same in Amazon AWS or in Microsoft Azzure (even if vendors are trying to differentiate and essentially lock-in customers).
  • Also an electrical plug provides universally the same service, change the plug and you still get 220V, 50hz electrical current, the same with a virtual Linux machine.

Well the analogy is correct but it also provides us with the limitations, that a Public Cloud infrastructure has with respect to preservation.

Unlike the electrical grid changing providers or plugs, is not as easy as you may initially think. Moving data among clouds has networking costs and vendors, as aforementioned are trying to lock users, but luckily there are standards and tools (vm images, puppet etc) that can be used to facilitate the process.

But on the other hand, the electrical power grid has something quite unsettling , we usually forget, and the same is true for the Computing Cloud:

Electrical energy cannot be stored in the grid. Its available in the moment of its creation but not for later consumption. Again no one but you, as a person or organization, is not held responsible for the persistence of you data, in case a cloud vendor goes out of business. Cloud computing is something dynamic,is not perstistent in the same way a hard disk or a tape on a shelf are.

This means that like the service of electrical energy, which holds no persistence information or character, in Cloud computing there is little inherent long term aspect. While this in not a problem for running VMs, for long term archiving and preservation this means risky business. Also the lack of transparency of the underlying infrastructure, and the subscription based model, is not helpful for preservation reasons.

Preservation is not something you need or you can plan for the short term demand. Preservation is like a dynamic vault, which, is not like a power plug in the wall.

1 D5pxF-2LYbdHq2U9sfDUIg

So are external run cloud resources a good alternative for bit wise preservation: my feeling, and I believe the Open Repositories community feeling is that we can use the Cloud as the electrical grid, but we need to build different systems and infrastructures, with low running cost, simplicity and built it rigidness or even antifragility for preservation. Also these systems, for organizational and financial rigidity they better include a community perspective. It is true that in the digital repositories case most of these aspects are true, in the case of communities like LOCKSS, or even collective organizations like Portico.

In the infrastructure side we should remember that the Public Cloud, like the Power Grid, is not like a vault and for preservation we really need a vault, a rigid structure with little on demand interaction, no need for subscriptions and no single points of organizational or financial failure.

The current mode of thinking for Digital Preservation infrastructures, as I was able to discuss it in Open Repositories 2014, is similar to the aforementioned. We see a increased usage of Public Cloud infrastructure resources, for day to day operations, but we still need self hosted, an excellent 24X7 rant is available here, or essentially community backed resources, for preservation to take place, even if some of them use a public cloud infrastructure on the backend, e.g. DuraCloud or MetaArchive.

Initially appeared here.