A recurring theme in our projects is a desire to “fix things so they stay fixed”. I have in mind writing about that idea in detail later, but for now I’ll start with an example of how to do so.
A common and useful thing to do with disk storage space is to keep old copies of important data around. For example, we might keep the last 15 days of nightly backups of a database. This is easy to set up and helpful to have around. Unfortunately, sooner or later we discover that the process of copying a new backup to a disk managed this way, fails because the disk is full: the ongoing growth of the backup files reached a point where 15 old ones plus a new one does not fit.
How will we fix this?
Idea #1: Reduce 15 days to 10 days. Great, now it doesn’t fail for a while… but eventually it fills up with 10 of the now-larger files. It didn’t stay fixed.
Idea #2: Buy a bigger disk (maybe a huge disk, if money is abundant). A while later, it fills up. It didn’t stay fixed.
Idea #3: Set up an automated monitoring system, so that someone is informed when the disk is getting close to full. This is a big improvement, because hopefully someone will notice the monitor message and adjust it before it fails. But to me, it is not “fixed to stay fixed” because I will have to pay someone to adjust it repeatedly over time.
Idea #4: Sign up for Amazon S3, so we can store an unlimited number of files, of an unlimited size. Thus will probably stay fixed from a technical point of view, but it is highly broken in the sense that you get a larger and larger S3 invoice, growing without limit. To me, this means it didn’t stay fixed.
Idea #5: Dynamically decide how many old backups to keep.
The core problem with the common design I described above is the fixed N of old files to keep. The solution is to make that number dynamic; here is one way to do that:
- Make the old-file-deletion processĀ look at the size of the most recent few files, and estimate the “max” of those plus some percentage as the likely maximum size of a new file.
- Compare that to the free space.
- If there is not enough free space, delete the oldest backup
- Loop back and try again.
- Be careful with error checking, and put in some lower limit of how many files to preserve (perhaps 2 or 3).
Like all mechanisms, this one has limits. Eventually the daily file size may grow so large that it’s no longer possible to keep 1 or more copies on the disk; so in this sense it does not stay fixed; but it does stay fixed all the way up to the limit of the hardware, with no human intervention.
Agreed, this is an important meta-idea. In fact, I think the level to which a team (or individual) grasps this and goes after it is probably a good measure of how successful they will be in maintaining the system over time.
When I was at NFJS this weekend, I saw a talk from Michael Nygard on “designing for ops”. One thing that resonated with me is that when they see support problems, they want the novelty of those problems to be high.
Seems to me there are multiple choices when you face a recurring operations problem:
1) develop manual response procedure
2) develop automated response procedure
3) develop an environmental fix (tweaking)
4) develop a temporary patch (something to tide you over till next release)
5) develop a permanent fix
These are ordered in ways closer to “fix it forever”. The correct choice is usually a mix between urgency, expediency, and difficulty of course.