Beating the Scrapers To Google

Nov 27, 2012 • 8:51 am | comments (11) by twitter Google+ | Filed Under Google Search Engine Optimization
 

pingA typical issue for a new or not yet so popular content site is having their content scraped by a site that may look more authoritative than their site. When that happens, it is possible that Google will rank the your content on a site that is not yours - yes, the stolen content site.

Google has a scraper algorithm but that can work to hurt original content owners in the situation above. So what can you do?

A WebmasterWorld thread is having conversation just on that topic. They are discussing using services to help Google spot the content on your site before they spot it on a scraper site. Some of those techniques include:

  • Pinging Google through services like SubPubHubbub
  • Posting the content on social networking sites like Google+, Twitter and Facebook
  • Making the page live without the content and then once the search engine crawls it, launch the content on the page hoping the spiders come back sooner.
  • Use the Fetch as GoogleBot feature in Webmaster Tools
  • Try Google Blog Search's pinging service.

Share your ideas to this common issue.

Forum discussion at WebmasterWorld.

Previous story: Google Bacon Number Got It Wrong
 

Comments:

Guest

11/27/2012 02:30 pm

also rel="author" could help protecting your contents

Lyndon NA

11/27/2012 02:31 pm

Erm - that list appears in the Google Webmaster Forums several times (in various forms (remember, I haven't been there in over a year)), and also on G+ several times too. I know - I'm the one that posted it :D

Henley Wing

11/27/2012 03:04 pm

Since most scrapers are just bots, you can write a simple script that blocks user agents that resemble bots. How? Look at their IP, and see if it's from a common web hosting service like AWS, or Rackspace. Pretty naive, but it should do the trick and shouldn't block the reputable bots like Yahoo, Googlebot, and Bing. If you want to get fancy, go scrape all the common web hosts from ARIN, put them in a hashmap, and compare the incoming IP to these.

Lyndon NA

11/27/2012 03:41 pm

Plenty of IP/UA Block lists out there, and even premade scripts to help deal with them.Adding in your own customised ones is advisable ... but to the common lay person (many who don't even realise server access logs exist), it's a bit of a culture shock ... and when they see the logs, panic usually occurs. BadBehaviour and ZBBlock are worth looking at, as is teh 4G and 5G block lists from Perishable Press.

joeyoungblood

11/28/2012 01:43 am

rel = author is a joke. how many authors are actually on google+ ??? they Google issued a set of code of websites to implement rel author would be viable, but since they are using it to force G+ it will only get as far as the SEO's and a handful of early adopter techies want to take it.

Tiggerito

11/28/2012 05:38 am

Slight correction, I believe it's called PubSubHubbub http://pubsubhubbub.googlecode.com/svn/trunk/pubsubhubbub-core-0.3.html

MonkeyMarcel

11/28/2012 01:34 pm

Yeah, in classifieds we have the aggregators taking our listings and then ranking really well while not creating one piece of original content. :-(

Kim

11/28/2012 04:55 pm

This happened to my site too and the site that stole my content outranks me since they have a pagerank of 9. I cancelled my wordpress feed which I guess is how they were stealing my content. They now show a 404 error for my content. I don't know if it did any good but I am hoping that Google will eventually start showing my content after the other site is reindexed.

Mozalami

11/28/2012 05:04 pm

Great advise thanks @barry will try to use Ping more often in future

Satinder Singh

11/29/2012 09:25 am

For fast indexing with in 10 hours >> submit your url at here https://www.google.com/webmasters/tools/submit-url

anggriawan

08/22/2013 02:15 am

I thank you in advance for visiting our website. there are some questions I can if I go up on the website in the top position? ask about science - its science

blog comments powered by Disqus