Meet The Crawlers

Mar 2, 2006 • 12:09 pm | comments (6) by twitter | Filed Under Search Engine Strategies 2006 New York
 


Session description:

"Representatives from major crawler-based search engines cover how to submit and feed them content, with plenty of Q&A time to cover issues related to ranking well and being indexed."

Moderated by Danny Sullivan and speakers include: Matt Cutts from Google, Kashual Kurapati from Ask Jeeves, a representative from Yahoo! (Tim Mayer was not present) and Ramez Naam from MSN Search

Audience Question: This is for Google and Yahoo: My site has over 500,000 products. What is the difference in the number of pages crawled and the number of mentions? For Yahoo we only have 500 results. Why is there a difference?

Yahoo: Use the Site Explorer tool. If Site Explorer only shows 500 pages, then there is an issue. Google: Every search engine crawls in a different way. Mentions vs indexed. There are instances where we know about the url, but we did not crawl it. Your site may not have enough PageRank for us to do a deep crawl. Yahoo: The site explorer offers an option to provide a RSS feed of your site's urls.

Audience Question: Disney search marketing manager. What are the search engine capabuilitie at crawling Flash. What are the pitfalls

MSN: FLash is difficult, so it will have an effect on the findability of your site. Yahoo: Flash is in the pipe, you should see some innovations coming soon. Google: We used to parse swf files, but Flash and Ajax can be problematic. THey break functions of your browser. My recommendation is to provide a text version of your site along with the Flash version. Yahoo: Cloaking was mentioned in a previous panel as a method of getting around all-Flash sites. Don't. Danny: A Flash page is like handing out a blank business card. Shows example of a jazz singer's site (he saw last night) that uses Flash with text.

Audience Question: Do you only crawl links found on pages or does your algorithm use queries from the toolbar? MSN: It may. Google: If you're trying to use the toolbar to get indexed, you should spend your time doing something else. Other ways to get into Google without inbound links: Site submit and Google Sitemap. Audience member: I don't want certain pages indexed. Panel: Add a robots.txt file excluding those pages and also put a password on that area. Google: Gives example of how Alexa toolbar has been spoofed and used to spam Matt's "related sites" info on the Alexa listing for his blog. Yahoo/Google: How many people would be concerned if anonymous toolbar data was used by search engines? Most of the audience raise their hands.

Danny: Brings up Flash issue and points to a thread on Search Engine Watch forum.

Audience Question: Is there any truth that search engines ignore robots.txt? MSN: No, we comply.

Audience Question: Asks about submitting to MSN. Is the RSS feed url submission for MSN and Yahoo only for new content? MSN: You can submit multiple URLs to MSN and that is not seen as a spam activitiy. MSN also now supports URL submissions using an RSS feed. Yahoo: If urls are repeated in different RSS feeds they will just be revisited.

Audience Question: We use dynamic urls to control page behaviors and run into problems where the same product is indexed under different urls with different parameters. Do you have any tips on what we can do to avoid this? Yahoo: Search engines are getting better at indexing dynamic content but must be careful of spider traps. Suggestion would be to use the URL submission tools available such as Google Sitemaps or Yahoo and MSN URL submissions using RSS feed. MSN: You can use robots.txt to block everything but the cannonical version of your page urls.

Danny asking search engines to get on the same page with robots.txt. Google: The only thing we don't support is crawl delay. Many webmasters that used that parameter incorrectly. Yahoo: We try hard to adhere to the standard. Google: Google Sitemaps offers a robots.txt feedback tool. Danny: That's a great tool and I wish all the engines would do the same.

Audience Question: How does the rate at which pages get updated that are linking to you affect your site getting crawled? MSN: Refresh rate of pages pointing to you doesn't factor. What matters is the freshness of your own site. Yahoo: Inbound links are more imporant for discovery. Google: The rate change of source pages is very much a secondary consideration. MSN: Regarding links: Links that look natural, that provide value are the ones we use. Also instead of buying links, think about creating unique content that provides value and people will link to it naturally.

Audience Question: We have a competitor that builds duplicate copies of his ecommerce site and the crawlers don't seem to be able to see this. We're thinking of doing the same thing if the crawlers aren't going to do anything about it. Yahoo: The algorithms are continuously being imroved and in some cases we need to look at situations individually. Google: Agrees, feel free to provide a specific example. Fill out a spam report and give an example. We do the best we can. but we need feedback. Ask: We try to take care of these situations when we discover them.

Audience Question: How does server response time affect crawling? MSN: A slow response time can be perceived as a down web site. It may cause us to crawl you more slowly. Yahoo: We'll typically revisit the site after a few days.

Danny: Is MSN going to do anything like Site Explorer? MSN: We're very interested in improving what we can offer webmasters and will be developing tools of that nature. Yahoo: We are adding new features to Site Explorer. Ask: As we ramp up with processes and resources to deal with the queue that builds up.

Audience Question: Some major news publications will list a url but not create a hyperlink to a site. Do you use that information into account? Ask: We do assign credit for newly found sites. If there is already a link to the site, additional links to the same site are not considered. The URL as text is not treated as a link. Yahoo: At this point we do not treat a text url as a link. Google: That delves into the secret sauce. Think of coverage in major publications as a traffic source but not as a way of getting link popularity. Yahoo: Y!Q creates links automatically to popular resources.

Google: Matt shows his Google Sitemap data using then new version of sitemaps. For some reason Matt's blog is #2 for a phrase like free porn on Google local. Shows a variety of information on his blog.

Yahoo: Points out answers.yahoo.com and is looking for feedback.

SES NYC Tag:

Previous story: Earning From Search & Contextual Ads
 

Comments:

Nadir

03/02/2006 06:20 pm

I would have loved an answer from Google to this question about robots.txt: "Is there any truth that search engines ignore robots.txt?" Many folks still believe that Googlebot sometimes goes to folders he's not allowed to.

SEO Power

03/03/2006 09:00 am

I asked a question to Matt, but did not get any answer and now no information in above mentioned day 4 discussion: --------------------------------------------- Q: I think “Meet The Crawlers” will have somthing to filter BLOG/Comment/Guestbook/Subdomain spamming. I have lost hope for optimization for my industry (you would not like the name). Although it hurts people like us, who are working for clients in genuine way. --------------------------------------------- Should we people assume that Spammers will always beat people like Matt & Tim?

Martin

03/03/2006 11:34 am

I would have asked Google about how regularly they check their spam reports (I personally think its just fake - because never ever happened anything ;-)

Brandon Hopkins

03/03/2006 08:07 pm

Thanks for that recap. Matt linked you from his blog and I found it that way. Brandon Hopkins

chris

03/05/2006 05:16 pm

Were this really all questions? Or were there more? Where can I find them? Thanks for your report.

simla

01/06/2009 12:40 pm

I have uploaded sitemap for my website in google webmaster tool. my sitemap can be found at www.example.com/sitemap.xml. 1938 urls in my sitemap.Now sitemap status shows: indexed url is only 6. Why google not index my sitemap fully? Does anyone know how to solve this problem?

blog comments powered by Disqus