Duplicate Content & Multiple Site Issues

Dec 5, 2006 • 3:03 pm | comments (1) by twitter Google+ | Filed Under Search Engine Strategies 2006 Chicago
 

Jon Glick from Become.com to present Beyond Ink's presentation. Duplicate content can be multiple homepages on different URLs. Different links to several different URLs and dynamic URLs. Search engines want one copy of your content, not duplicate pages. You can confuse the robots through dynamic URLs, be careful with that. Mirror sites are two domains with the same exact content on two different domains are duplicate content. Two different domains that represent the same exact content is also dup content. Google and other engines will choose the best domain only for you. You choose one canonical domain and link all internal pages on the site to it. Exclude landing pages for tracking from search engines using robots.txt. Use 301 redirects to point all your domains to a single domain. Use server side redirects, 302 redirects are temporary, use only for content that is going to change, such as event schedule. He shows that you can contact Google. Yahoo has a similar form, and they actually tell you if you are banned. (Um so does Google).

Shari Thurow from Grantastic Designs is up next. What is duplicate content? Is it 65%? It is not a percentage, it is a resemblance, Shari said. Search engines do not want it in their search engines because it slows down info retrieval process, and searchers do not want to get the same content and results over and over again. She explains clustering as a way Google and other engines group results together. Types of dup content filters; - Content properties, they strip boiler plate of the page (nav, footer, etc.) - Linkage properties both inbound and outbound - Content Evolution (65% of web content will not change on a weekly basis, 0.8% of web content will change completely on a weekly basis, average page mutation) - Host name resolution - Shingle comparison (web pages have a unique signature or fingerprint, break down content into sets of word patterns, order doesnt matter)

Use the robots.txt file to exclude duplicate pages. Some duplicate content is considered spam and some is not. She shows some examples. If people are stealing your content, higher an attorney to sue them. Copyscape is a good tool, archive.org, also copyright your material. Use DMCA reporting at Google, Yahoo, Ask and MSN.

Mikkel deMib Svendsen from deMib.com. He is going to talk mostly about "identical issues." Common issues include; www or non www, session ids, url rewriting, many to one problems in forums, sort order parameters and bread crumb navigation but the list of dup issues are almost infinite.

- WWW vs. Non WWW used to be an issue, now it is not an issue. If some are linking to your non www, and some to your www, that is not so good, so use a 301 redirect. - Session IDs can be a nightmare. One site had 200,000 versions of the same exact page in Yahoo Search. The solution is to dump the session info into a cookie and not put it in the URL. - Customize Permalink Structure for your blog software. Some times old URLs work, so now you have two URLs that work exactly the same way. It is a huge issue with many open source sites but (I personally think) Google handles this well. Make sure to block or 301 those URLs. Wordpress has a plugin that does it for you, it is a WordPress Canonical URL plugin. - Many to one problems, specifically with forums. Get to the same page in a forum via a different URL. - Sort Order Parameters is a common issue, for this identify spiders and 301 them to the default URL. - Breadcrumb navigation can be an issue also. Most of the time you replicate the breadscrumb in the URL structure, and you may have other ways to get to the same page - so you have several URLs but same content. You can make sure your product URLs have one URL, store breadcrumb info in the cookie instead.

Adam Lasnik from Google and Tim Converse from Yahoo are also on the panel for Q&A.

These posts may have spelling and grammar issues. These are session notes, written quickly and posted immediately after the session has been completed. Please excuse any grammar or spelling issues with session posts.

Previous story: Domaining & Address Bar-Driven Traffic
 

Comments:

cvos

12/06/2006 02:39 am

I agree that duplicate content is not an exact percentage, but associations between disparate web pages. For example, super competitive search phrases may have different duplicate content 'filters' than low frequency searches.

blog comments powered by Disqus