Since there is still a lot of work to be done (as well as some theoretical issues to overcome) with redundant servers, there obviously needs to be significantly improved backups immediately. Here's what I've come up with so far:
I've already set the off-site backups to take place every 4 hours. Unfortunately, the problem with the server that went down happened about 23 hours into the 24 hour backup cycle (i.e. at the worst possible time). This will immediately drastically cut down on the risk period. (This applies to all Powerblogs servers.)
The new server is more powerful than the server which went down, and in particular has two identically sized hard drives (larger than on the previous server, too). I'm going to set up a program to mirror the entire server onto its second hard drive, so if something goes wrong with the primary hard drive, we can immediately reboot to the secondary hard drive, minimizing downtime. I'm thinking that to start we'll sync it every hour. Tom, the Powerblogs Idea Rat (his semi-official title), is doing research into whether we can use realtime filesystem change information to bring the syncing to something like every minute. (This will only apply to the new server.)
I had forgotten that I purposely enabled the mailing list archives with being a last-ditch backup in mind. Unfortunately, we ended up having to use the last-ditch safety net — not something that should ever happen — so I want to improve this concept. I'm going to set up an off-site mail server and modify the powerblogs software to email the full post information in machine-readable form (suitable for use in automatic restoring), before it does anything else, to the off-site backup. This will make the off-site backups for the posts genuinely real-time, or at least very, very, very close to it. (This will apply to all servers.)
When we get the old server back (and test it thoroughly), until it's a redundant server sibling with the newest server, I'll keep it around with a full install and configuration of the powerblogs software, ready to take on the role of another server if anything goes wrong.
#1 is done already. #2 will not take long to set up, though the downside is that it will require scheduled downtime in order to test. #3 will take a little longer to implement, but it won't take very long. I'm guessing that it will take about a day or two to get the first version of the full-system syncing to the second hard drive. I should have the email code and email receptacle up in about a week. (Please note, since I screwed up with this before, my estimates are not guarantees, and please assume that anything that I talk about is not implemented until I explicitly say that it's finished and live.)
The long-term plan is for redundant peer-to-peer servers which will truly have no single points of failure and can operate both together with load balancing and realtime syncing, and independently with resyncing. There are still a few practical problems with this that need addressing, and a few theoretical ones, but I think that it's doable and will only be a few months away.
Comments and suggestions about these backup plans would be appreciated. One of the problems that this has exposed is that Powerblogs operates on pretty thin margins (given the cost of bandwidth, the bandwidth that accounts come with, the cost of disk space, the size chunks that we have to buy the stuff in in order to get good prices, and the infrastructure to do development), which makes reliability enhancements like spare servers and RAID for primary storage somewhere between difficult and not doable. Now, Powerblogs has been especially unlucky (not counting the very brief outage a few weeks ago when an errant program filled up the hard drive — ironically, that would have taken out even redundant servers, since the 100Gb file would have been replicated to them both), but bad luck can be overcome with money. I'd be especially interested to know how users would feel about increased prices in order to pay for higher-end hardware (with RAID to guard against disk failure, more RAM and CPU for better performance, etc), faster connections to the off-site backup, etc. For example, if Powerblogs doubled prices, we should be able to afford to rent a dual 3.2GHz Xeon with 4 GB ram and 4 250GB SATA drives in RAID 0+1 for 500GB of usable storage. (RAID 0+1 means fast reads and the data is always on 2 drives at any time, so that a single drive failure won't impact the system's uptime at all.) It would be blazingly fast, handle high loads very well, and be quite reliable (there's also the effect in computers that the more the computer costs, typically the higher quality all of the parts in it).
I would greatly appreciate if subscribers could leave comments whether you'd want to pay more for a better system and higher reliability, or whether you prefer going the less expensive route and doing the best that we can with what we have? What do you guys want? Where do you think that we should go?
Update: I've made the initial copy of the data onto the second hard drive in the server. I'll be working on setting up the scheduled syncing to the second hard drive tomorrow. Within a few days, I hope to have the post-email-backups going, and within a two weeks, we might have two off-site locations that will be getting the posts emailed to them.
Update: I've been working on the syncing, and unfortunately it's not as fast as I want it yet. I'm periodically syncing it manually, and before too long I should have it scheduled to do the syncing. I'm also working on some changes to the Powerblogs code that will let the syncs go faster. (The reason for the concern over speed is that the syncing places some stress on the server, and while reader page loads should still be pretty quick, the Powerblogs interface itself will be slowed down a bit. I don't want improved reliability to come at the expense of increased frustration.)