A common approach in IT is to take a backup or clone of a server and put it somewhere ‘safe’ for later use in the event of disaster. I’ve seen this waved around as a viable option many a time, the idea being that if something happens to the live system, then the clone can be wheeled out to replace it quickly. It’s a bit for bit copy of the original, what’s not to like about that idea?
Back in the day
Way back when things moved slowly, programs were monolithic and systems were far less dependent on other parts this was a viable approach. You could rebuild from a master copy, your OS probably hadn’t had a release in a year, and in any case, your license didn’t cover upgrades and the versions of applications installed were on a similar lifecycle.
But that all changed.
Now, this approach is flawed. Everything is made of more discrete parts than before – code is re-used, systems are making use of shared common blocks. All these separate projects update and upgrade as time goes on. Security holes are discovered and fixed, bugs repaired, functionality changes over time. So if you’re not keeping up, you’ll accumulate bugs and security holes and more to the point, your system could stop working!
“Software rots if not used”
A cloned or backed-up system that is untested, untried and not modified will rot. From the moment you take the copy, it starts to drift from where you are. This can be acceptable if you are making a specific change and require a snapshot to roll back to, which will undo the specific change you have performed.
But any longer and you are running the risk that the system will not operate as required, and will probably do so in unexpected and unknown ways.
It may fail to connect to some remote service because the remote service has upgraded the authentication mechanism, or the ciphers used in the TLS configuration have moved on to a more secure method. You’re now going to end up in a situation where you have suddenly no idea how many problems you have or how long it’ll take to fix. How many patch cycles is it behind? You could end up re-doing multiple patch cycles and configuration changes to get back to operation.
[Aside: Of course, you can mitigate against this by taking more frequent backups and incrementals, but that’s a discussion for another day]
What if the restore fails? Does anyone know how to build that system from scratch? Is the documentation up to date (and how do you update docs that were used once years ago)?
And we haven’t even talked about the data backups. You are keeping separate data and configuration/system backups at least, aren’t you? Aren’t you? [I digress, this is a whole other discussion]
Bootstrapping
The way around this is to pull yourself up by the bootstraps – building systems that build themselves.
The world of IT has moved into the infrastructure as code realm over the last few years, where system configurations are specified in some domain specific language that is run by a machine to produce more machines. It’s a logical step along from using a high level language to write machine code that runs on a processor, like most things, at some point it became viable to extend this idea to compiling infrastructure. It’s given rise to a set of infrastructure management tools like Ansible, Terraforms and Puppet, and a whole new skillset which combines the wisdom of an infrastructure engineer with the coding skills of a developer.
What these tools do is manage configurations, they’re a production line to build systems. Their raw material inputs are things like software packages, operating systems binaries, bare metal machines or cloud tenancies. They combine this with configurations and data, and produce working systems at the other end.
The important bit here is that they are live, they build the system as it needs to be today – patched, up to date, tested and ready to operate. No need to restore a system from an old backup, use your production mechanism aggressively and rebuild every time.
By doing this you achieve several goals:
- You know how to build your system from scratch every time – it’s documented in the code
- Your documentation-for-humans can concentrate on the purpose, intention and how to run the build system rather than specifying the myriad steps that the machine can handle better
- You reduce your backup requirement to your application data which is now separated from your application code, system binaries, disk layouts and so on
- You have reduced your backup footprint! [There is as yet no escape from requiring backup management]
So, once you start to industrialize your infrastructure-as-code, don’t just tiptoe around it, use it aggressively. Maximize the benefits!