IHY: blocking referrer spam with htaccess

Apr 15, 2006 • 10:45 am | comments (2) by twitter | Filed Under

In Referrer Spam - Why you shouldn't publish your Web logs - Best Practices Search Engine Forums, an old thread originally posted by Alan Perkins, was recently revived by IHY member Dave B, who posts an htaccess file that looks like it could make a big dent in referrer spam for web sites that implement it.

Referrer spam is the practice of sending fake visitors with fake referrers to web sites, to have your URL appear in their log files. This is done in the hope that search engines will find the links and boost the spamming site's rankings.

It's not just a search engine spam problem, though: referrer spam can also interfere with traffic analysis. I checked my logs against Dave's htaccess list, and it looks like about 95% of the fake traffic would be blocked. Nice.

Previous story: Google Adopting "Kinder - Gentler" China Strategy



04/15/2006 10:57 pm

That .htacess "filtering" will demand a considerable amount of processing from your server. And just like Email subjects, they could get around the "keyword" filtering by using certain Modified Spellings of Misspellings. Even if you do publish your Web Stats and decide to NOT make it a password protected directory .... you could just as easily "disallow it to be spidered in the robots.txt. And with some highly developed Client Side trackers - you can ban IP addresses from being calculated as a visit, so as to focus on the REAL visits.

Dan Thies

04/18/2006 03:54 pm

Search Engines Web, I see you posting on blogs a lot. Hope that's working for you. There's more than one objective here, preventing someone from getting ranked for some porn search term I don't care about isn't high on my list. Having clean log files is, as is preventing competitors from reading my stats. Misspellings would defeat the htaccess filtering, but would also largely defeat the purpose of spamming referrer logs with keyword-stuffed domains. :D There's no reason for a spammer to do that, since there are always more open stats pages to exploit. Using robots.txt would keep the regular search engines from spidering log files, but you have no idea how much information you're giving away to competitors if you stats can be found by just anyone. Password protection is a minimum to avoid doing that. I'm sure there are plenty of ways to do this, including post-processing of log files with the same pattern matching, to eliminate "visits" from logfile spammers. When you're talking about server load, though, geez, servers are cheap - an hour of my time vs. adding another dedicated server, I'll take the hit and get a server.

blog comments powered by Disqus