PowerBlogs.Com Development

Live Backups now active (New Features)

It's taken longer than I had hoped, but the live backing up of posts in their source form has now been pushed live. Now, every time a post gets written, a copy of the text gets sent to an offsite-backup immediately before the post is written locally. As long as the server can reach the internet, then, published posts will be backed up. The current Live Backups offsite-backup server is different than the offsite backup server which was in place since the beginning of Powerblogs. Soon we'll get that server configured for Live Backups as well, and then we'll have two offsite backups of posts separated by a few hundred miles.

This has been tested and should not have a noticeable impact on speed, but it hasn't been tested live (obviously), so during our initial experiences with it going live, there may be some delays in responsiveness in posting (on the order of about 30 seconds at the utmost maximum). If there are, please bear with us and we'll have those delays resolved quickly (but do let us know).

Update: Of course, there were timing issues. Unfortunately, they ran into the minutes, not seconds, so I've temporarily disabled the Live Backup system. I'm going to modify it to run the backup in the background so it has no chance of slowing down posting, and push it live again. With luck, that will be tonight.

Update: This is attempt two. Now the backup process runs in parallel to the posting process, so there should be no way that the Live Backups can possibly interfere with posting. It works great in testing, so I've pushed it live again. If you have any issues posting, please let us know immediately. Thanks!

Update: Ok, attempt two didn't work either. It was fast enough, but apparently sometimes interfered with the front page being rewritten after posts are published. I disabled LiveBackups again. The next version uses a completely different approach where the user interface process writes to a local process which then turns around and makes the foreign connection. The local connection will always be instant, since the same computer that you're running on can't disappear out from under you (or get lagged because some construction worker took out an internet line with a backhoe and traffic is being routed around it) unlike foreign computers, so this one should be downright foolproof. I hope to have it up tonight.

Update: Ok, here goes attempt three. As I mentioned, this one has the user interface process dumping the data off to another local process which can wait on the remote backup server without stalling the user interface or the writing of the blog pages. I've tested it very thoroughly, which means that it might work. ;)

Posted by Chris on 03.24.2006. (0 Comments)
Tagline support (New Features)

I've added a TagLine variable to the header template. (The tagline can be edited in the Theme tab). Taglines underneath headers are increasingly common in blogs, and this allows one to set up the tagline without having to change the header template (which can get replaced by themes).

I still have to add support for the tagline to the existing themes and the custom theme creation script, but even as it is this should make things a bit easier for people who want to use taglines.

Posted by Chris on 11.16.2005. (0 Comments)
Trackback spam (Improvements, New Features)

One of our servers has lately been the victim of a massive amount of trackback spamming. This has been causing the system to be very slow and even brought down the server twice.

I've therefore gone in and streamlined the trackback rejection process significantly. The trackback drum now rejects trackbacks identified as spam much faster. As a result, the load on the server seems to be much lower and pretty stable. Thus this major problem is solved.

While I was rewriting the trackback drum, I added some word-based trackback filtering to help catch trackback spam by common words used by spammers. Currently, words only get added to this list on a manual basis, to avoid stupid things like "a", "an", "the", "no", etc. being added to the list.


(note: the rest of this is technical, so you can safely skip it if you're not interested in the technical aspects of the problem of getting rid of trackback spam.)

The Problem

The main problem with trackbacks is that by their definition they have to be sendable by arbitrary computers without human intervention. This is an open invitation to trackback spam, which predictably has come to pass.

Current Email-like Solutions

So of course there are the same solutions as with email: whitelist-only, blacklists, bayesian filtering, etc. Whitelist-only generally proves way too cumbersome to actually use, so unless someone comes up with some really good insight on that score, we can pretty safely ignore it.

Word-based blacklisting is difficult to maintain, but otherwise a workable partial solution. (note: by difficult I don't mean "insurmountable", I mean "requires ongoing human effort which generally scales with the amount of effort going into spam".)

Host-based blacklisting looks, based on the patterns of spam, to be marginally effective since trackback spam comes from a whole lot of changing IP addresses. I haven't gathered enough data, though, so there might be some merit here. With the profusion of spam-related viruses, trojans, etc., though, I doubt that host-based blacklisting will be an effective long-term solution.

Target-based blacklisting might have more potentional, as trackback spam (by its brevity) must take you to a URL, and it's not so easy to move around the hosts that URLs are at. This still requires ongoing maintenance, but it might be a workable partial-solution as well.

Bayesian filtering is, conceptually speaking, a sort of automatic blacklisting. (For those not familiar with it, the 5-second explanation is that you train a bayesian filter on text which you've categorized, and then ask it what category new text is in; i.e. you classify things as spam or not-spam, train it, then have it classify new trackbacks.) The main problem with bayesian filtering is that it requires you to do ongoing work manually classifying trackbacks as spam or non-spam, and moreover is defeatable if spammers include a lot of text unrelated to their spam. One sees this fairly often with emails that include stories, etc. I believe that a popular method with email is to grab chucks of text from legitimate sources and include that in a mime text alternative or in html which won't be displayed. The brevity of trackback excerpts might mitigate against this approach to defeating trackback spam, and presumably one can use a spell-checker to guard against intentionally mis-spelled words.

Current Trackback-Specific Solutions

The only one I'm familiar with is link-verification. That is, when someone tracks back to you, go fetch the page that they're claiming is a post about your post and see if it indeed links to your post. It's an interesting idea, and from what I gather works now, but there are a few problems with this as a long-term solution:

  • It's costly to fetch a page and wait for it to come back. This means lots of processes sitting around on your machine tying up resources. Granted, you can impose a timeout delay, but to be realistic (i.e. not reject too many legitimate trackbacks) it would have to be something like 5-10 seconds. In that time you can accumulate an awful lot of processes handling trackbacks if the spammers are hitting you hard, and you really don't want to have that many resources on hand just to serve spammers.

  • The spammers can pretty easily add URLs to all of their victims on their page (depending on their implementation, but they can probably adapt to be able to do this), thus generating 100% false negatives. This will bloat the size of their pages and cost them bandwidth, but with some CGI they can probably guess who's on the other end and return a custom-tailored page to save on bandwidth.

  • This requires everyone to use the canonical URL when linking to posts. This isn't the biggest hurdle in the world, but some people have multiple URLs, and besides I've seen MovableType installations which give out non-canonical URLs as their permalinks. This is far from an insurmountable objection, but it does mean that sites which use it will break more frequently.

It's definitely an idea with promise, but I'm very leary of the costs imposed by fetching a page every time someone you don't trust causes you too. It's just got the makings of a denial-of-service attack.

Speculative Solutions

No A tags or links in excerpts

The current speculative solution which seems very promising is to refuse all trackbacks with A tags or URLs in the excerpt. I don't think that trackback spammers will get around this very easily, as forgoing links or URLs in the excerpts of their trackbacks will substantially reduce the usefulness of trackback spam to them. It's true that this restricts legitimate trackbacks as well, but I've looked around and most legitimate trackbacks don't have links in them anyway. Besides, if the norm becomes that links in trackback spam are not permissible, legitimate bloggers will have lost very little since the purpose of the excerpt is to be an excerpt — the link is provided to be a link.

Of course, if this becomes common trackback spammers will have to do without links in their excerpts and hope that people click on the trackback link. Still, while it will then be useless in filtering out trackback spam (because no spam will match), it will reduce the usefulness of sending trackback spam. If we can reduce the benefit to a trackback spammer of his spam far enough, fewer of them will bother. (Of course, it can have the opposite effect, as well: if we cut the usefulness of trackback spam in half, they might just send out twice as many spams. Still, the end of that argument — give the trackback spammer all of our money and he won't spam us — is just to avoid something bad by something worse.)

Assume most trackbacks are spam

Because legitimate trackbacks are relatively uncommon (since they require a human being to write a post about another blogger's post), and because of the sheer volume of trackback spam, it's probably safe to assume that as a long-term average, 90% or more of the trackbacks received are spam. The numbers might be a lot worse, too — they might be more like 95% spam or 99% spam.

If that is the case, we might be able to come up with decent automatic word-based blacklisting for titles and excerpts by a heuristic something like: if more than 10% of trackbacks and less than 95% of trackbacks in a 24 hours period contain the word, add it to the blacklist. Trackbacks, by their nature, must be all very similar. I believe that some spammers are automatically generating their spams to avoid them being identical, but they still have to have a fair amount in common, because the spammer is sending his spams for a reason. If the spammer wants to sell some black-market pharmaceutical, he has to try to communicate this to you somehow. There are many ways in human languages to communicate ideas, but they tend to have a lot of commonalities. For example, it would be very difficult to inform someone that you're selling viagra without mentioning viagra, vi agra, v1agra, "a better sex life", etc. Moreover, this being the contentful part which is supposed to be understood by the reader, some human had to put it together in a way that would be understandable. That creates a human-effort bottleneck which will tend to encourage similarity.

Obviously, as with any heuristic the devil is in the details. Maybe the right numbers are 5%-80% == spam, (< 5% || > 80%) == non-spam. Maybe the right range is 20-%80% or 30%-98%. Reglardless, there's probably some range which does guarantee that the words are spam-specific. (The reason for the upper limit is to filter out things which are common to every trackback, e.g. "a", "an", "the".)

If simple blacklisting from this sort of categorization catches too many legitimate messages, it might work to make it a scoring system instead — each word in the appropriate range is worth a point, and trackbacks which 2 or more points are rejected. There are a lot of possibilities.

The beauty of this technique is that it requires no ongoing human effort in order to get its classifications, and it gets its classification by attacking our enemy in his strength. That is, the reason that trackback spam is attractive to the spammer is that he sends to much of the stuff. If he were to cut back, the spam wouldn't do him much good (and wouldn't annoy us decent folk much, either). So the spammer can't avoid sending the vast majority of spams without hurting his own cause substantially. (I learned the idea of "attack an enemy's strength, not his weakness", from a marketing book Marketing Warfare, which in turn got it from military strategists, possibly Clausewitz. The basic idea is that if you attack an enemy's weakness, he'll fix that weakness and you'll have wasted your effort if you didn't get him immediately. If you attack an enemy's strength, he can't respond without reducing his strength, so no matter how he responds or doesn't, you weaken him. One fairly obvious example is that if your strategy is low cost through economies of scale, you have to homogenize your products because that's the only way to get economies of scale, so a competitor can attack you by offering customization. If you respond by offering customization, you'll have to increase the cost of your products and so compete without your cost advantage. If you don't respond, you cede that part of the market. (The proper response is to try to make your products flexible so that people won't want customization, but there are a lot of markets where that doesn't work.))

Conclusions

As always, we need more information. In particular, a good collection of spam and non-spam trackbacks together with some analysis will prove very useful. I'll try to post it if I come up with anything interesting.

Posted by Chris on 11.15.2005. (0 Comments)
Time Zones (New Features)

I must apologize for the long absence from posting. There's been a lot going on and I've just been swamped.

Powerblogs now supports configurable time zones, so your blog can now show its dates in any time zone you choose.

Posted by Chris on 11.11.2005. (0 Comments)
Update Service Pings (New Features)

It is now possible to disable sending pings to the update services. If you're not interested in the update services, you now don't need to waste time or bandwidth on them. (The default is still to send them.)

Posted by Chris on 09.03.2005. (0 Comments)
XML RPC interface (New Features)

Powerblogs now supports posting with an XML RPC interface. Specifically, users can use programs which support the MetaWeblog API with the URL http://blogname.powerblogs.com/admin/xmlrpc.pl — this might be especially handy for users who want to post from PDAs.

It also supports the blogger API, but the blogger API is pretty terrible — it doesn't even support titles. At some point, I'll probably implement as much compatibility with the MovableType API as possible, and likely also implement a Powerblogs-specific XML RPC API (I am toying with the notion of writing a java post composition program to go with Powerblogs to make disconnected post composition easier, but this is a really long-term goal, not anything which is going to happen soon).

Posted by Chris on 08.07.2005. (2 Comments)
Category Trees (New Features)

A new variable has been added to the sidebar templates — CategoryTree. It gives a list of the categories, but does so as a nested series of lists, representing the parent/child relationships in the category tree.

The same category tree is also displayed on the manage categories page now, so it's easier to remember what the various parent/child relationships in the tree are.

Posted by Dev Team on 04.06.2005. (0 Comments)
Comment Account Approval Notification (New Features)

Applicants for comment accounts will now automatically be notified via email when you approve their account (if you require account approval).

Posted by Chris on 03.30.2005. (0 Comments)
Trackbacks page template (New Features)

The trackbacks page is now controlled via an editable template. You can set the text above the trackbacks to anything that you want, or add any arbitrary HTML that you want to.

Posted by Chris on 03.30.2005. (0 Comments)
Comment Approval (New Features)

It is now possible to require comments to get approval (individually) before they appear on posts. For those having problems with uncivil commenters, this is the ultimate weapon that guarantees you will win. The drawback, of course, is that you have to be active in frequently moderating new comments or it will slow the discussions down. Still, it could well be a good tool for encouraging only extremely thoughtful comments.

Thanks, Keith!

Posted by Chris on 03.26.2005. (0 Comments)
New Post Variables (comment-related) (New Features)

There are two new variables for the post templates: [$CommentsURL$] and [$CommentsCount$]. With these variables, you can create your own comments link that does pretty nearly anything that you want. Thanks for the idea, Andrew!

Posted by Chris on 03.26.2005. (0 Comments)
Category renaming and reparenting (New Features)

The Manage Categories page now has two more options: rename a category and reparent a category.

Related Posts (on one page):

  1. Reparenting to a parent
  2. Category renaming and reparenting
Posted by Chris on 03.23.2005. (0 Comments)
Trackbacks (New Features)

It's now possible to have the trackbacks appear in a popup, like comments.

Posted by Chris on 03.02.2005. (0 Comments)
Recent Post List (New Features)

There's now a variable available in the Header, Footer, Right Sidebar and Left Sidebar that prints a list of the most recent posts: RecentPostTitles — it's documented on those pages.

Posted by Chris on 02.27.2005. (0 Comments)
Initial configuration wizard (New Features)

I've uploaded the first version of Powerblogs initial configuration wizard. It should help force a few important choices before people start blogging to make things easier. At some point, it will probably also mention tips and tricks, too.

Posted by Chris on 02.24.2005. (0 Comments)
Image browsing and inserting (New Features)

When Powerblogs started, it was tedious to insert images because you had to upload them, then copy over the name of the image. If you were ambitious, you'd also look up the width and height of the image and add them to the image tag.

The "Upload & Insert a Picture" tool improved the situation a lot because it combined the upload and the insertion into one, plus it harnessed the power of computers to do boring work to add the width and height attributes.

It was still tedious to insert the same image into multiple posts, though. While life was good for people who wanted to post pictures of some particular event, there was no good way to use images to signify some category, like a picture to categorize what you're talking about.

All that is finally at an end. There's now a tool to browse through the images which you've already uploaded and insert them! At long last, most anything you'll want to do with images is now easy. Enjoy!

Posted by Chris on 02.10.2005. (0 Comments)
Breaking post chains (New Features)

There's finally a script to break post chains. In particular, the "Break Post Chain" page allows you to break one or more posts out of their chains ("Break Post Chain" is found in the "Posts" tab).

Update: There had been a bug with this script where not all of the archives in the post chain were getting rebuilt. That's now fixed.

Posted by Chris on 02.09.2005. (0 Comments)
Spell Checker (New Features)

There is now a spell checking tool available via the tools menu. It's still in BETA, but it should be pretty usable.

Posted by Chris on 01.27.2005. (0 Comments)
A new report (New Features)

At the request of Eugene Volokh, I've added a new report (available in the reports section) which generates statistics on the activity of all of the bloggers in a blog.

It assembles a fair number of statistics which makes for a very large table. As a result, I've also added an ability to get the data as a .csv (comma separated values) file for easy importing into a spreadsheet.

Posted by Chris on 01.20.2005. (0 Comments)
Post Chain pages (New Features)

Post chains now get a page devoted to that post chain. Thus instead of clicking through to view each post in a post chain, a reader can just click the "on one page" link and read all of the posts in chronological order.

Posted by Chris on 01.16.2005. (0 Comments)