Bot Herding

Jun 3, 2008 • 5:27 pm | comments (2) by twitter | Filed Under Search Marketing Expo 2008 Seattle
 

Bot Herding - Search spiders and bots are pretty stupid when the come to your web site. If you don't guide them, they'll generate duplicate content issues, miss important pages in favor of junk, not realize where existing content has moved to and have other problems. This session looks at some advanced techniques in herding bots, when IP delivery can be what hat and how search engines view cloaking issues today.

Moderator: Rand Fishkin, Co-Founder and CEO, SEOmoz

Q&A Moderator: Matt McGee, FOR HIRE. Email him. He rocks.

Speakers:

Adam Audette, Founder, AudetteMedia Hamlet Batista, President, Nemedia S.A. Nathan Buggia, Lead PM, Live Search Webmaster Center, Microsoft Priyank Garg, Director Product Management, Yahoo! Search, Yahoo, Inc. Michael Gray, President, Atlas Web Service Evan Roseman, Software Engineer, Google Stephan Spencer, Founder and President, Netconcepts

Michael Gray is up first. Why don't people condition their mailboxes? Let's say you're buying a new house and you can't afford air conditioning except for in a few rooms. When you make more money, you might add more AC units. Once you get more money, you're going to put in central air. No matter how much money you make, are you ever going to air condition your mailbox? No, because it's not a good use of your resources.

Let's take that analogy to your website. You're building a site but have no money. You don't have a lot of links. You don't have a lot of PageRank. Are you going to send that PR to your contact page? Does it make sense? You're looking to send PR that makes most sense to you.

Think about how much PR and link equity you have. If you're a big company, it doesn't matter. But smaller sites need to worry more about where you send your PageRank. Send it only to pages that drive conversions and sales.

What can you sculpt out? Who wants to rank for their privacy policy, terms of use, or contact us? Locations - unless you are multi location business, put your address in footer and sculpt out location pages. Company bios - unless you are involved in reptuation mamagement, then sculpt them out Sitewide footer links, advertising stats, and legal pages don't need it.

How to sculpt: - Nofollow is quick and easy but search engines may be looking at it and people may know that an SEO is involved. - Javascript - old school, relies on client side technology, currently bots don't crawl it but it may change in the future - Form pages, jump apges, redirect pages - they're more complex to implement and maintain, and search engines currently don't follow them but that may change

Be consistent or you're shooting yourself in the foot. Don't let them in one way and not let them in the other way. Always use robots.txt and meta noindex in conjunction with any PR sculpting. Account for outside links and any spider or search engine quirks.

Should you do this now or should you wait? - If you hasve critical issues, this takes a backseat: take care of your fires, then do it - New sites should use this

Adam Audette talks about 8 arguments against sculpting pagerank with Nofollow.

1. More control? Having a mechanism at a link level for spidering is good. The rub is that we don't know everything. We don't know enough or how much PageRank we have on a domain. We're attempting to control the PR on a domain and we don't know how much PR we have on a domain, a single page, how much PR fluctuates, and how much a link is worth in PR. It's very imprecise.

2. It's a distraction. There are a lot of things we can do to make our page really great.

3. Management headaches: when you have a large site, you may have numerous departments working on the same page. With turnover, unless you have procedure in place, it's confusing. Ask: why are 5 links nofollowed on this page and what department handles this?

4. It's a Band-Aid. People are using nofollow to alleviate symptoms on site and they're not addressing underlying causes like good structure and design.

5. Where's the user? The user experience is not being addressed. Lots of PR to float mediocre pages? Are we giving more power to high authority domains? The web is a level playing field and we're not focusing on what's online. On the web, big business doesn't always dominate.

6. It's open to abuse. People can do things in sneaky ways and we don't know - it can be abused. How are search engines going to react? We really don't know because there's no standard on nofollow. Matt Cutts has stated that "the mechanism is completely general."

The blaance of advanced SEO - what's right for your users and what's right for search engines

7. Too focused on search engines. This is about creating sites for users, not for search engines. Cater to search engines. Does this help my users? Would I do this if search engines didn't exist? These are quotes from the Google Guidelines.

8. There is no stnadard. Every engine treats nofollow differently. It's also way too focused on Google - nofollow targets Google primarily. It's all about Google's PageRank.

You can go to his followup on www.audettemedia.com/nofollow

Next up is Stephan Spencer. He talks about herding bots away from duplicate content.

Duplicate content is rampant on blogs. Herd bots to permalink URL and lead everywhere else (archive by date, category pages, tag pages, homepage, etc. with paraphrased optional excerpt) - Not just the first couple paragraphs - the MORE tag - Requires you to revise your main index template theme file

Include a sig line and headshot photo at the bottom of post/article. Link to the original article and post permalink URL.

On e-commerce sites, you have issues with multiple parameters, product descriptions, guided navigation, paginations with categories, tracking parameters.

Selectively append tracking codes for humans with "white had cloaking" or use JavaScript to append the codes.

Pagination: not only creates many pages that share the smae keyword theme, but also very large categories with thousands of products result in hundreds of pagess of product listing getting crawled. A lot of pages may not get indexed. - Nofollow the "View all links" or funnel all pageRank through keyword rich subcategory links. Your mileage will vary. Test all of this.

PageRank leakage? - If you're using robots.txt disallow, you're probably leaking PR - Robots.txt disallow and meta robots noindex both accumulate and pass PR * Meta noindex tag on a master sitemap will deindex the page but still pass PR to linked sub sitemap pages Rewriting spider-unfriendly URLs - use URL rewriting server module/plugins such as mod_rewrite recode your scripts to extract variables out of the path_info part of the url; regular expressions in rewrite rules are great. IIS servers have ISAPI plugins. Rewrite rules are great if you have a bunch of rewrite rules. You can also implement 301 redirects using rewrite rules.

Note: [NC] on rewriterules means nocase.

He shows a conditional redirect example. I can't type it out without typoing, but he explains that if there's a PHP session ID, you can redirect for certain bots.

You can drop error pages out of the index, but if people link to you with a space after the trailing slash, you can 301 redirect that instead of it being www.domain.com/%20. You can prevent indexing of the error page. Do a 301 redirect to something valuable to something like your homepage and dynamically give you an error message.

This presentation is at www.netconcepts.com/learn/bot-herding.ppt. You can read it for yourself! w00t.

Next up is Hamlet Batista: white hat cloaking - six practical applications. Go cloak, he says. - There's good cloaking and bad cloaking. IT's all about your intention. - Always weigh the risks versus the rewards of cloaking - Ask permission - or don't call it cloaking. - Mention "IP delivery," not cloaking.

When should we be cloaking? We're talking about white hat cloaking. We're going to talk about practical scenarios and alternatives. How do we cloak? How can cloking be detected? What are risks and next steps?

Practical cloaking - - Content accessibility when you get search unfriendly CMSes - Rich media sites - Content behind forms Membership sites - Free and paid content Site structure improvements - Alternative to PR sculpting via nofollow Geolocation and IP delivery Multivariate testing

Practice Scenario #1: proprietary website management systems that are not search engine friendly. - Users see dynamic URLs, session IDs, canonical issues, missing titles and meta descriptions. But the search engines can see something else: SE friendly URLs, URLs without session IDs, URLs with consistent naming conventions, automatically generated titles and meta descriptions.

Practical Scenario #2: a video or flash/video intensive website. - USers see flash. Search engines see text representations of graphical images and elements, text representation of all motion/video elements, text transcription of all audio in the rich media content.

Practical Scenario #3: membership sites like SEOmoz Pro (hello, I'm wearing an SEOmoz shirt today) - Search users see snippets of premium content on the SERPs, and when they land on the page they're faced with a registration form. Members see the same content that search engine robots see.

Practical Scenario #4: sites requiring massive site structure changes to improve index penetration. Regular users follow a link structure designed for ease of navigation. Search robots follow a link structure deesigned for ease of crawling and deeper index penetration of the most important content. He explains that it's like a train on a rail - sometimes the path will change. In this case, there's a path for the best penetration of the index.

Practical Scenario #5: geolocation Regular users see: content tailored to their geographical area should be provided. Robots say that this should be okay

Practical Scenario #6: split testing Regular users see content experiment alternatives. Search robots see the same content consistently

How do we cloak? - Robot detection by HTTP cookie test - Robot detection by Javascript/CSS test - Robot detection by visitor behavior. Computers are predictable. They follow links in a certain way. They spend a certain amount of time on links.

He also has another 3000 slides that he can't show us because we're out of time. That is all. Thank you for reading.

Previous story: Money For What? Search Marketing Payment Models
 

Comments:

Jaan Kanellis

06/06/2008 12:45 am

Let's take that analogy to your website. You're building a site but have no money. You don't have a lot of links. You don't have a lot of PageRank. Are you going to send that PR to your contact page? Does it make sense? You're looking to send PR that makes most sense to you." OK now your assming that PR is tangible, whoops "Next up is Stephan Spencer. He talks about herding bots away from duplicate content." How about not using a band-aid and just get rid of the dup content or block it through robots.txt?

Andy Beard

06/07/2008 03:17 pm

Here are a few conflicts From Michael "Don't let them in one way and not let them in the other way. Always use robots.txt and meta noindex in conjunction with any PR sculpting." From Stephen "Requires you to revise your main index template theme file" - most good themes have separate templates for these, and there are plugins that can change behavior Also why use the more tag rather than custom excerpts? Custom can be created automatically and then edited. "Robots.txt disallow and meta robots noindex both accumulate and pass PR" - robots.txt disallow doesn't pass pagerank, the juice teleports away to a random page on the net. Jaan, the reason you don't use robots.txt for dup content is because it is still a juice leak.

blog comments powered by Disqus