Are XML Sitemaps Files a Welcoming Door in for Scrapers?

May 7, 2007 - 7:45 am 1 by
Filed Under Spam

An excellent WebmasterWorld thread asks if the new Sitemaps Auto-Discovery supported by all four major search engines is not just an easy way for search engines to find and index your content. The thread asks if this enables scrapers to easily find and scrap your most important content?

There is no doubt in my mind that having an XML feed helps scrapers do their work. That is part of the debate over should I offer a full feed versus a short feed. Full text feeds enable scrapers to take your content and all of it, much quicker.

The Sitemaps.xml files are not full text feeds, they are just directional data for search engines to easily find your most important content. A crawler then does the rest of the work. But it does help scrapers do the same thing.

The WebmasterWorld has some pretty good feedback.

Tedster said:

After all, the sitemap.xml file hands over a list of urls directly to any scraper that wants to make use of it. And excessively scraped sites can struggle in the SERPs. Sounds like a very good reason for cloaking to me.

incrediBILL explains:

Sitemaps.xml is a serious scraping vulnerability which is one reason I don't use it as the sitemap.xml file is a clear path to crawl without hitting any spider traps so it should be cloaked, no doubt about it. Any time you give scrapers a clear path to avoid honey pots and spider traps they'll use it. With that said, the scrapers can simply scrape a search engine first using "site:mydomain.com" to get the equivalent of a sitemap and avoid your spider traps anyway.

That's why even robots.txt should be cloaked because you give the scrapers a list of user agents that you allow to crawl. Assuming you don't also restrict user agents by IP range or reverse DNS, the scrapers just adopt the allowed UA's and slide right through your .htaccess files or other user agent blocking fire walls.

The thread continues but not having a sitemaps file does not prevent scraping of your content.

Forum discussion at WebmasterWorld.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
Google Core Update Flux, AdSense Ad Intent, California Link Tax & More - YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Video Recaps

Search News Buzz Video Recap: Google Core Update Flux, AdSense Ad Intent, California Link Tax & More

Apr 19, 2024 - 8:01 am
Google Ads

Google Tests More Google Ad Card Formats

Apr 19, 2024 - 7:51 am
Google Search Engine Optimization

Google: It's Unlikely Your Rankings Dropped Because You Have Two Websites

Apr 19, 2024 - 7:41 am
Google Search Engine Optimization

Google: Indexing API May Work For Unsupported Content But...

Apr 19, 2024 - 7:31 am
Google Search Engine Optimization

Google: Are Hyphenated Domains Bad For Google Rankings?

Apr 19, 2024 - 7:21 am
Bing Search

Bingbot To Test Zstd Compression After Fully Gaining Full Brotli Compression

Apr 19, 2024 - 7:11 am
Previous Story: Google Testing Blog Search Results In Google.com