The logfile reports will resume generation within the next few days. When they resume they'll be generated off-site, so they will never again impact the performance of the powerblogs servers. Thanks for your patience.
And happy Thanksgiving!
The logfile reports will resume generation within the next few days. When they resume they'll be generated off-site, so they will never again impact the performance of the powerblogs servers. Thanks for your patience.
And happy Thanksgiving!
I've added a TagLine variable to the header template. (The tagline can be edited in the Theme tab). Taglines underneath headers are increasingly common in blogs, and this allows one to set up the tagline without having to change the header template (which can get replaced by themes).
I still have to add support for the tagline to the existing themes and the custom theme creation script, but even as it is this should make things a bit easier for people who want to use taglines.
I just pushed some code cleanups live. In a few places that was streamlining the code, but mostly that means that more templates take advantage of caching. Both of these should result in minor speed increases during publishing.
The Theme tab has been cleaned up some as there is now one script to edit the four stylesheets. It's also been improved to be a bit easier during multiple edits.
Also, the custom sidebar content is now available via the theme tab, which it always should have been. I'm also planning to expand it to have at least two and probably 4 custom sections, to accomodate the current trend towards having advertising on both sides (right now, to put advertising on both sides you have to stick it into the sidebar templates directly, which is not a big deal but is sub-optimal).
One of our servers has lately been the victim of a massive amount of trackback spamming. This has been causing the system to be very slow and even brought down the server twice.
I've therefore gone in and streamlined the trackback rejection process significantly. The trackback drum now rejects trackbacks identified as spam much faster. As a result, the load on the server seems to be much lower and pretty stable. Thus this major problem is solved.
While I was rewriting the trackback drum, I added some word-based trackback filtering to help catch trackback spam by common words used by spammers. Currently, words only get added to this list on a manual basis, to avoid stupid things like "a", "an", "the", "no", etc. being added to the list.
(note: the rest of this is technical, so you can safely skip it if you're not interested in the technical aspects of the problem of getting rid of trackback spam.)
The main problem with trackbacks is that by their definition they have to be sendable by arbitrary computers without human intervention. This is an open invitation to trackback spam, which predictably has come to pass.
So of course there are the same solutions as with email: whitelist-only, blacklists, bayesian filtering, etc. Whitelist-only generally proves way too cumbersome to actually use, so unless someone comes up with some really good insight on that score, we can pretty safely ignore it.
Word-based blacklisting is difficult to maintain, but otherwise a workable partial solution. (note: by difficult I don't mean "insurmountable", I mean "requires ongoing human effort which generally scales with the amount of effort going into spam".)
Host-based blacklisting looks, based on the patterns of spam, to be marginally effective since trackback spam comes from a whole lot of changing IP addresses. I haven't gathered enough data, though, so there might be some merit here. With the profusion of spam-related viruses, trojans, etc., though, I doubt that host-based blacklisting will be an effective long-term solution.
Target-based blacklisting might have more potentional, as trackback spam (by its brevity) must take you to a URL, and it's not so easy to move around the hosts that URLs are at. This still requires ongoing maintenance, but it might be a workable partial-solution as well.
Bayesian filtering is, conceptually speaking, a sort of automatic blacklisting. (For those not familiar with it, the 5-second explanation is that you train a bayesian filter on text which you've categorized, and then ask it what category new text is in; i.e. you classify things as spam or not-spam, train it, then have it classify new trackbacks.) The main problem with bayesian filtering is that it requires you to do ongoing work manually classifying trackbacks as spam or non-spam, and moreover is defeatable if spammers include a lot of text unrelated to their spam. One sees this fairly often with emails that include stories, etc. I believe that a popular method with email is to grab chucks of text from legitimate sources and include that in a mime text alternative or in html which won't be displayed. The brevity of trackback excerpts might mitigate against this approach to defeating trackback spam, and presumably one can use a spell-checker to guard against intentionally mis-spelled words.
The only one I'm familiar with is link-verification. That is, when someone tracks back to you, go fetch the page that they're claiming is a post about your post and see if it indeed links to your post. It's an interesting idea, and from what I gather works now, but there are a few problems with this as a long-term solution:
It's costly to fetch a page and wait for it to come back. This means lots of processes sitting around on your machine tying up resources. Granted, you can impose a timeout delay, but to be realistic (i.e. not reject too many legitimate trackbacks) it would have to be something like 5-10 seconds. In that time you can accumulate an awful lot of processes handling trackbacks if the spammers are hitting you hard, and you really don't want to have that many resources on hand just to serve spammers.
The spammers can pretty easily add URLs to all of their victims on their page (depending on their implementation, but they can probably adapt to be able to do this), thus generating 100% false negatives. This will bloat the size of their pages and cost them bandwidth, but with some CGI they can probably guess who's on the other end and return a custom-tailored page to save on bandwidth.
This requires everyone to use the canonical URL when linking to posts. This isn't the biggest hurdle in the world, but some people have multiple URLs, and besides I've seen MovableType installations which give out non-canonical URLs as their permalinks. This is far from an insurmountable objection, but it does mean that sites which use it will break more frequently.
It's definitely an idea with promise, but I'm very leary of the costs imposed by fetching a page every time someone you don't trust causes you too. It's just got the makings of a denial-of-service attack.
The current speculative solution which seems very promising is to refuse all trackbacks with A tags or URLs in the excerpt. I don't think that trackback spammers will get around this very easily, as forgoing links or URLs in the excerpts of their trackbacks will substantially reduce the usefulness of trackback spam to them. It's true that this restricts legitimate trackbacks as well, but I've looked around and most legitimate trackbacks don't have links in them anyway. Besides, if the norm becomes that links in trackback spam are not permissible, legitimate bloggers will have lost very little since the purpose of the excerpt is to be an excerpt — the link is provided to be a link.
Of course, if this becomes common trackback spammers will have to do without links in their excerpts and hope that people click on the trackback link. Still, while it will then be useless in filtering out trackback spam (because no spam will match), it will reduce the usefulness of sending trackback spam. If we can reduce the benefit to a trackback spammer of his spam far enough, fewer of them will bother. (Of course, it can have the opposite effect, as well: if we cut the usefulness of trackback spam in half, they might just send out twice as many spams. Still, the end of that argument — give the trackback spammer all of our money and he won't spam us — is just to avoid something bad by something worse.)
Because legitimate trackbacks are relatively uncommon (since they require a human being to write a post about another blogger's post), and because of the sheer volume of trackback spam, it's probably safe to assume that as a long-term average, 90% or more of the trackbacks received are spam. The numbers might be a lot worse, too — they might be more like 95% spam or 99% spam.
If that is the case, we might be able to come up with decent automatic word-based blacklisting for titles and excerpts by a heuristic something like: if more than 10% of trackbacks and less than 95% of trackbacks in a 24 hours period contain the word, add it to the blacklist. Trackbacks, by their nature, must be all very similar. I believe that some spammers are automatically generating their spams to avoid them being identical, but they still have to have a fair amount in common, because the spammer is sending his spams for a reason. If the spammer wants to sell some black-market pharmaceutical, he has to try to communicate this to you somehow. There are many ways in human languages to communicate ideas, but they tend to have a lot of commonalities. For example, it would be very difficult to inform someone that you're selling viagra without mentioning viagra, vi agra, v1agra, "a better sex life", etc. Moreover, this being the contentful part which is supposed to be understood by the reader, some human had to put it together in a way that would be understandable. That creates a human-effort bottleneck which will tend to encourage similarity.
Obviously, as with any heuristic the devil is in the details. Maybe the right numbers are 5%-80% == spam, (< 5% || > 80%) == non-spam. Maybe the right range is 20-%80% or 30%-98%. Reglardless, there's probably some range which does guarantee that the words are spam-specific. (The reason for the upper limit is to filter out things which are common to every trackback, e.g. "a", "an", "the".)
If simple blacklisting from this sort of categorization catches too many legitimate messages, it might work to make it a scoring system instead — each word in the appropriate range is worth a point, and trackbacks which 2 or more points are rejected. There are a lot of possibilities.
The beauty of this technique is that it requires no ongoing human effort in order to get its classifications, and it gets its classification by attacking our enemy in his strength. That is, the reason that trackback spam is attractive to the spammer is that he sends to much of the stuff. If he were to cut back, the spam wouldn't do him much good (and wouldn't annoy us decent folk much, either). So the spammer can't avoid sending the vast majority of spams without hurting his own cause substantially. (I learned the idea of "attack an enemy's strength, not his weakness", from a marketing book Marketing Warfare, which in turn got it from military strategists, possibly Clausewitz. The basic idea is that if you attack an enemy's weakness, he'll fix that weakness and you'll have wasted your effort if you didn't get him immediately. If you attack an enemy's strength, he can't respond without reducing his strength, so no matter how he responds or doesn't, you weaken him. One fairly obvious example is that if your strategy is low cost through economies of scale, you have to homogenize your products because that's the only way to get economies of scale, so a competitor can attack you by offering customization. If you respond by offering customization, you'll have to increase the cost of your products and so compete without your cost advantage. If you don't respond, you cede that part of the market. (The proper response is to try to make your products flexible so that people won't want customization, but there are a lot of markets where that doesn't work.))
As always, we need more information. In particular, a good collection of spam and non-spam trackbacks together with some analysis will prove very useful. I'll try to post it if I come up with anything interesting.
Those of you who had comments being emailed to you may have noticed that the name, email, and url were being omitted for those commenters with comment accounts. This has now been fixed.
So, for whatever reason, WordPress doesn't support the industry standard RDF for trackback auto-discovery. Instead, it uses a special link tag (which it can get away with because WordPress permalinks must be on individual pages -- WordPress can't use weekly archives as post permalinks). Anyhow, all grumbling aside, Powerblogs now supports this method of trackback auto-discovery.
I must apologize for the long absence from posting. There's been a lot going on and I've just been swamped.
Powerblogs now supports configurable time zones, so your blog can now show its dates in any time zone you choose.