DuckDuckGo Not Respecting Robots.txt Directives?

Nov 27, 2012 • 9:07 am | comments (15) by twitter Google+ | Filed Under Other Search Engines
 

DuckDuckGoThere is an interesting thread at Hacker News on DuckDuckGo being toyed by Google. That part honestly doesn't interest me as much as the core SEO topic in the thread.

Google's Matt Cutts is very active on Hacker News and he questioned Gabriel Weinberg, the founder of DuckDuckGo, about the DuckDuckGo spiders and crawlers. There are some folks asking if DuckDuckGo's spider, aka DuckDuckBot, respectes the robots.txt directives.

Some noticed DuckDuckGo crawling under the IP range they own but not declaring the useragent and thus not respecting the robots directives set by the webmaster.

Matt Cutts asked Gabriel:

Gabriel, does DuckDuckGo's crawler have a distinct user agent? Can you talk more about how DuckDuckGo observes/respects robots.txt?

I emailed Gabriel and he explained that in this case, they are only checking for parked domains. He wrote, "what they're seeing there is not a crawler but a parked domain checker." He added, "it doesn't crawl through a site. In fact, it only checks the front page." When I questioned why they can't do this using the DuckDuckBot useragent, he said, "some parked domain networks show different things based on the user agent, and we want to find out what is really shown to the user."

He added also:

We don't believe it needs to be identified as anything else as it only makes one request very infrequently and doesn't index any information.

Forum discussion at Hacker News.

Previous story: Beating the Scrapers To Google
 

Comments:

shillshillshill

11/27/2012 02:35 pm

Barry Schwarz, Matt Cutts' right hand. Matt says a word that bashes a Google competitor and Barry along with Danny Sullivan and other but kissers gang up.

Barry Schwartz

11/27/2012 02:37 pm

Love you.

shillshillshill

11/27/2012 03:22 pm

So Barry why is it news? Gabriel denied and explained it before you published it. Why publish it? Did Matt Cutts "suggest" this as s story?

Barry Schwartz

11/27/2012 03:26 pm

It is news because it explains how duckduckgo's bots work.

Jim Christian

11/27/2012 04:42 pm

Jeez Barry, maybe you should just stop blogging all together and we can search for the answers ourselves over countless blogs and forums. Nope... no value in getting information culminated in a single location.

G

11/27/2012 04:56 pm

The glorious irony of Google having a pop at a company about privacy...

Dave Collins

11/27/2012 06:09 pm

Oh lordy. I don't usually weigh in on these things, but if you don't like what Barry says, don't read what he writes. Simple really.

Barry Schwartz

11/27/2012 06:22 pm

I agree, stop reading.

aybecker

11/27/2012 07:02 pm

Matt Cutts asked him that question, but the real question Matt Cutts should be answering is how Google handles indexed/cached versions of pages restricted in robots?

Ask it Barry

11/27/2012 07:29 pm

Future headline: Google manipulates Search after Panda and Penguin to increase ad clicks by 120%? Don't hold your breath, Matt Cutts, the conman gives marching orders to Barry.

Alan

11/28/2012 12:40 am

I agree stop reading. While I don't always agree with Barry I don't think he is a shill (until I started reading this blog I didn't know what that word meant). He is just reporting stuff that he thinks we may find interesting. Quite frankly Barry on this blog is a lot more willing to take a dig at Google than anyone on searchengineland and he is willing to leave his comments open no matter how negative they get. Try making these sorts of comments at SEL. They will be shut down faster than a drug deal in a police station.

joeyoungblood

11/28/2012 01:38 am

You mean about respecting Robots.txt. Google will index urls blocked by Robots.txt though they claim not to crawl the content.

Josh

11/28/2012 07:46 pm

As much as I hate trolls and stupid comments, if Barry (or any other webmaster) did not want comments, be it positive or negative, then the whole log system along with comments are voided. Let the hater's hate and that's that. Negative attention is better than no attention - and that's one more Advert click for Barry.

Howard Drive

03/01/2013 10:57 am

Duck Duck go i think its very old crawler or something not heard.

Jugar Jugar

10/18/2013 10:53 am

I really do not understand why there are hackers against such activities. My site that collapsed because they do not know what the cause. It's frustrating.

blog comments powered by Disqus