PowerBlogs.Com Development

Improved backups (News)

Since there is still a lot of work to be done (as well as some theoretical issues to overcome) with redundant servers, there obviously needs to be significantly improved backups immediately. Here's what I've come up with so far:

  1. I've already set the off-site backups to take place every 4 hours. Unfortunately, the problem with the server that went down happened about 23 hours into the 24 hour backup cycle (i.e. at the worst possible time). This will immediately drastically cut down on the risk period. (This applies to all Powerblogs servers.)

  2. The new server is more powerful than the server which went down, and in particular has two identically sized hard drives (larger than on the previous server, too). I'm going to set up a program to mirror the entire server onto its second hard drive, so if something goes wrong with the primary hard drive, we can immediately reboot to the secondary hard drive, minimizing downtime. I'm thinking that to start we'll sync it every hour. Tom, the Powerblogs Idea Rat (his semi-official title), is doing research into whether we can use realtime filesystem change information to bring the syncing to something like every minute. (This will only apply to the new server.)

  3. I had forgotten that I purposely enabled the mailing list archives with being a last-ditch backup in mind. Unfortunately, we ended up having to use the last-ditch safety net — not something that should ever happen — so I want to improve this concept. I'm going to set up an off-site mail server and modify the powerblogs software to email the full post information in machine-readable form (suitable for use in automatic restoring), before it does anything else, to the off-site backup. This will make the off-site backups for the posts genuinely real-time, or at least very, very, very close to it. (This will apply to all servers.)

  4. When we get the old server back (and test it thoroughly), until it's a redundant server sibling with the newest server, I'll keep it around with a full install and configuration of the powerblogs software, ready to take on the role of another server if anything goes wrong.

#1 is done already. #2 will not take long to set up, though the downside is that it will require scheduled downtime in order to test. #3 will take a little longer to implement, but it won't take very long. I'm guessing that it will take about a day or two to get the first version of the full-system syncing to the second hard drive. I should have the email code and email receptacle up in about a week. (Please note, since I screwed up with this before, my estimates are not guarantees, and please assume that anything that I talk about is not implemented until I explicitly say that it's finished and live.)

The long-term plan is for redundant peer-to-peer servers which will truly have no single points of failure and can operate both together with load balancing and realtime syncing, and independently with resyncing. There are still a few practical problems with this that need addressing, and a few theoretical ones, but I think that it's doable and will only be a few months away.

Comments and suggestions about these backup plans would be appreciated. One of the problems that this has exposed is that Powerblogs operates on pretty thin margins (given the cost of bandwidth, the bandwidth that accounts come with, the cost of disk space, the size chunks that we have to buy the stuff in in order to get good prices, and the infrastructure to do development), which makes reliability enhancements like spare servers and RAID for primary storage somewhere between difficult and not doable. Now, Powerblogs has been especially unlucky (not counting the very brief outage a few weeks ago when an errant program filled up the hard drive — ironically, that would have taken out even redundant servers, since the 100Gb file would have been replicated to them both), but bad luck can be overcome with money. I'd be especially interested to know how users would feel about increased prices in order to pay for higher-end hardware (with RAID to guard against disk failure, more RAM and CPU for better performance, etc), faster connections to the off-site backup, etc. For example, if Powerblogs doubled prices, we should be able to afford to rent a dual 3.2GHz Xeon with 4 GB ram and 4 250GB SATA drives in RAID 0+1 for 500GB of usable storage. (RAID 0+1 means fast reads and the data is always on 2 drives at any time, so that a single drive failure won't impact the system's uptime at all.) It would be blazingly fast, handle high loads very well, and be quite reliable (there's also the effect in computers that the more the computer costs, typically the higher quality all of the parts in it).

I would greatly appreciate if subscribers could leave comments whether you'd want to pay more for a better system and higher reliability, or whether you prefer going the less expensive route and doing the best that we can with what we have? What do you guys want? Where do you think that we should go?

Update: I've made the initial copy of the data onto the second hard drive in the server. I'll be working on setting up the scheduled syncing to the second hard drive tomorrow. Within a few days, I hope to have the post-email-backups going, and within a two weeks, we might have two off-site locations that will be getting the posts emailed to them.

Update: I've been working on the syncing, and unfortunately it's not as fast as I want it yet. I'm periodically syncing it manually, and before too long I should have it scheduled to do the syncing. I'm also working on some changes to the Powerblogs code that will let the syncs go faster. (The reason for the concern over speed is that the syncing places some stress on the server, and while reader page loads should still be pretty quick, the Powerblogs interface itself will be slowed down a bit. I don't want improved reliability to come at the expense of increased frustration.)

Posted by Chris on 12.29.2005. (7 Comments)
Server Problems (News)

Something is wrong with one of the powerblogs servers, and we're currently working on figuring out what and resolving it. I'll keep you up to date as I get more information.

Update: The problem appears to be with server's hardware. While the people at EV1's datacenter are working on fixing that, I've rented another server and will be restoring it from backups (we have a backup current as of early this morning EST). If I'm done before the other server is fixed, it will become the new primary server and the other server will become its backup when it's fixed. If the current server is fixed first, this new server will become a backup for the old server. I've gotten a lot of progress done on having the software support the live backup system that I described earlier, so when I've got two servers, I'll push to get the rest done ASAP (the log reports were a major bottleneck, but the new off-server generation method will now only take a little bit of modification to adapt to that problem).

Update: So it appears to be a hard drive problem. I've put in to have an EV1 specialist look at the machine and replace the hard drive if necessary. In the mean time, we're working to get the new server up and loaded with the backups. My guess now is that I'll have that up before the older server is fixed. For some people who set their DNS statically (rather than using a CNAME), once the new server is up we'll have to modify your DNS. (Once the new server is up, youraccount.powerblogs.com will work immediately, as will anyone whose DNS is a CNAME.) I'll keep you up to date on the progress with the new server we've ordered and are currently configuring.

Update: We've got the new server, and have done the base config. I'm working on turning it into a Powerblogs server at the moment, which is coming along well. Soon, I'll be testing it out. Once I've tested it, I'll start restoring from this morning's backups.

Update: I believe that I've finished configuring the new server as a Powerblogs server. I've been testing it, and I the tests are looking good now. I'm just about ready to start restoring from this morning's backups.

Update: I've done tests, and the server config definitely appears correct. I'm uploading the backup data to the server. As soon as it's up, I'll begin restoring from it.

Update: Reloading the backup data is going well. I've got the main database up and have done some initial integrity checks and it looks good. I'm still working on getting auxiliary data (stylesheets, uploaded files, etc) up and restored. Once I'm finished with that, I have to republish the blog pages, do a final check to make sure everything is ready, and modify the DNS settings, but once that's finished we'll be good to go.

Update: Oh, fyi, I still haven't heard back from the EV1 "systems support specialist" about the old server. They were going to begin investigating shortly 6 hours ago.

Update: I've got all of the data uploaded and am currently regenerating the blogs. After this, I need to verify that everything's correct and change the DNS, and we'll be all set.

Update: The regeneration is moving along. I've been doing some testing and things are looking good so far. It shouldn't be too long after the regeneration that we go live. I've got a few more tests to do, and then I might start changing the DNS for blogs as they finish their regeneration individually.

Update: The regeneration is going pretty well. Only volokh and whiteperil are left. If your DNS isn't working, please send an email to support@powerblogs.com and I'll investigate. I think that most people on this server are set up to automatically pick up the DNS change.

Also, I've heard from EV1 and it looks like we might be able to recover the data from the hard drive. I'll let you know more on that front as soon as they tell me.

Update: The regeneration is very nearly complete, and nearly everything is back up again. Scheduled posting is not yet enabled, however, and probably won't be until tomorrow. For various technical reasons, when I restored it re-created posts which were saved for later. I need to write a quick program to cull the posts which are saved for later but were already published, and I'm too tired at the moment to trust myself to do that correctly. I should have that up some time tomorrow. I'm going to bed now, but the DNS for the remaining blogs has been set and I've done a partial regeneration on them, so their front page works and they can be used. That means that as of now, all Powerblogs blogs can be used, except for scheduling posts (you can schedule them, they just won't be published until some time tomorrow).

Update: I'm working on fixing the problem where saved posts that were published have been "resurrected". I hope to have the redundant saved posts removed within an hour or so.

EV1 has indicated that prospects for file recovery looks hopeful, but that it will take some time. They said that they should have further results for me in the late afternoon (I believe that that's Texas time). I should mention, though, that even if all of the data from the drive can be recovered, it is possible that a post could have been caught at the exactly wrong time so that it was never written to disk in the first place.

I also just realized another layer of backup which we have — the mailing list archives. You can get to them at http://powerblogs.com/pipermail/yourhostname/ — any lost posts might be there, since that's hosted on a different machine from the one that went down.

Update: The mailing list archives are looking to be a real saver from this disaster. So, if you may have lost any posts that were made after the last daily backup was taken, check your mailing list archive. It's URL is http://powerblogs.com/pipermail/{yourhostname}/2005-December/date.html — e.g. the archive for dev.powerblogs.com is http://powerblogs.com/pipermail/dev/2005-December/date.html

Update: I'm nearly done culling the saved posts, and will very shortly re-enable the scheduled poster.

Update: I'm done culling the saved posts, and have re-enabled the scheduled publisher. Scheduled posts should be good from here on out.

Update: For some blogs, the trackbacks to their posts didn't restore correctly. The trackback data is not lost, and I'm working on restoring that.

I'd also like to take this opportunity, now that things are calming down, to give my heartfelt apologies that it was not handled better and more swiftly, and that it caused so much grief, frustration, and worry. I'm very sorry for that, and we're working on ways to improve our reliability.

Update: For some people, not all of the comment pages were written properly. I've triggered a republishing of everyone's comment pages. This is slowing things down a little bit, but that should be over with in an hour or two. Also, I've been tuning the system to improve performance.

Posted by Chris on 12.28.2005. (64 Comments)
Downtime (News)

Some Powerblogs users experienced a few minutes of downtime recently. The short version is that it's over and won't happen again.

The longer version is that there was a very subtle bug in the new remote report software which somehow made one user's log report consume all available disk space (the bug was subtle; the effects certainly weren't). Servers require some disk space for temporary files and such, or they can't do things like serve pages or let people log in. I'm modifying the report software to ensure that it will never write large html files, so this problem will never happen again.

Posted by Chris on 12.20.2005. (0 Comments)
rel="nofollow" (Improvements)

Google has created a way to fight comment spam. It involves adding rel="nofollow" to all of the links in comments, trackbacks, readership reports, etc. I've updated the Powerblogs software to make use of this tag to help fight trackback spam.

Posted by Chris on 12.17.2005. (0 Comments)
Reports coming back on line (News)

I've finally got the offsite log generation system to the point where it's downloading the logs, generating the reports, and uploading them. It will still be a few days until the logs are generated regularly — for the next few days I'm going to manually run the report generation so that I can watch it and make sure that everything is going well. I will run the report script at least once a day for now. Once I'm confident in the new system, I'll have it back to generating the reports every four hours.

Posted by Chris on 12.14.2005. (0 Comments)
reports status (News)

Just a heas up on the reports: I'm almost ready to go live, now. I expect to have the reports live any day now — probably starting saturday night. (Because log data is valuable, and the off-site solution involves moving it and deleting it on the webserver, extensive testing needs to be done to ensure that no readership data is ever lost. Testing and debugging always takes more time that programming does, unfortunately.)

I am very grateful for the patience everyone has shown on this issue.

Posted by Chris on 12.08.2005. (0 Comments)
Reader statistics (News)

The new setup for readership statistics generation is going to be very robust, but it's not quite ready. It's the top priority project, though, and should be done very soon now. I'm hopeful that I can get it working before wednesday starts.

Posted by Chris on 12.06.2005. (0 Comments)
Scheduled posts display order (Improvements)

The order that saved-for-later posts display in on the page for choosing them now groups the posts scheduled for publishing first, and orders them by date. This was suggested by the EclectEcon.

(The off-site report generation is coming along, and I hope to have it up and running by this monday.)

Related Posts (on one page):

  1. Scheduled posts display order
  2. Scheduled posts
Posted by Chris on 12.03.2005. (0 Comments)
Several small fixes (Bug Fixes)

A few small things have been cleaned up, mostly edge cases. The exception is saved posts — editing a post which is saved-for-later now properly saves and loads any post chain or category information for the post. (This didn't affect scheduled-publish posts, which were technically saved-for-later posts for various technical reasons.)

Posted by Chris on 12.02.2005. (0 Comments)
Scheduled posts (Improvements)

There is now a convenient interface (in the "Edit a Saved Post" page) for rescheduling posts which have been scheduled to be published. It uses AJAX, so it requires Firefox 1.0.7, Internet Explorer 6, Safari 2.0.2, Opera 8.5 or other ECMA-262 compatible browsers.

There are still a few shortcomings in the interface, mostly as a result of some highly technical (and subtle) javascript execution-context details, and I hope to work them out soon. It shouldn't be a significant issue, though. Mostly it just means that the rescheduling won't get saved on the server if you close the rescheduling window too quickly. I'll eventually figured out a way around this, but it shouldn't be a problem for many people anyway.

Related Posts (on one page):

  1. Scheduled posts display order
  2. Scheduled posts
Posted by Chris on 12.01.2005. (0 Comments)