Google Adopting "Kinder - Gentler" China Strategy | Main | Google AdWords API To Add Commercial Fees July 1

IHY: blocking referrer spam with htaccess

In Referrer Spam - Why you shouldn't publish your Web logs - Best Practices Search Engine Forums, an old thread originally posted by Alan Perkins, was recently revived by IHY member Dave B, who posts an htaccess file that looks like it could make a big dent in referrer spam for web sites that implement it.

Referrer spam is the practice of sending fake visitors with fake referrers to web sites, to have your URL appear in their log files. This is done in the hope that search engines will find the links and boost the spamming site's rankings.

It's not just a search engine spam problem, though: referrer spam can also interfere with traffic analysis. I checked my logs against Dave's htaccess list, and it looks like about 95% of the fake traffic would be blocked. Nice.



Like The Story? Vote For It On Yahoo Buzz! Or On Sphinn!

posted DanThies in at April 15, 2006 10:45 AM Comments (2)

Comments

That .htacess "filtering" will demand a considerable amount of processing from your server. And just like Email subjects, they could get around the "keyword" filtering by using certain Modified Spellings of Misspellings.

Even if you do publish your Web Stats and decide to NOT make it a password protected directory .... you could just as easily "disallow it to be spidered in the robots.txt.

And with some highly developed Client Side trackers - you can ban IP addresses from being calculated as a visit, so as to focus on the REAL visits.

 

Search Engines Web, I see you posting on blogs a lot. Hope that's working for you.

There's more than one objective here, preventing someone from getting ranked for some porn search term I don't care about isn't high on my list. Having clean log files is, as is preventing competitors from reading my stats.

Misspellings would defeat the htaccess filtering, but would also largely defeat the purpose of spamming referrer logs with keyword-stuffed domains. :D There's no reason for a spammer to do that, since there are always more open stats pages to exploit.

Using robots.txt would keep the regular search engines from spidering log files, but you have no idea how much information you're giving away to competitors if you stats can be found by just anyone. Password protection is a minimum to avoid doing that.

I'm sure there are plenty of ways to do this, including post-processing of log files with the same pattern matching, to eliminate "visits" from logfile spammers.

When you're talking about server load, though, geez, servers are cheap - an hour of my time vs. adding another dedicated server, I'll take the hit and get a server.

 

Post a comment (Note: Can Take 120 Seconds For Your Comment To Show Up)

Do you want us to save your personal Information?

Premium Sponsors + advertise

To subscribe to the Search Engine Roundtable, click here