Duplicate Content and Multiple Site Issues

Aug 8, 2006 • 3:04 pm | comments (7) by twitter | Filed Under Search Engine Strategies 2006 San Jose
 

Duplicate Content and Multiple Site Issues

Moderated by Anne Kennedy from Beyond Ink, who also has a short presentation. “Double Trouble” What is duplicate content? Multiple domains: identical homepage, different URL’s. Different links to several URL’s from one site. The reasons not to do this: the SE’s say so. That is enough. Confusing the robot (crawler): Mirror sites (one website multiple domains). Uses an example of the site for the International Research Foundation for RDS/CRPS. They decided instead of using the (.org) Top Level Domain, they thought it would be better to be a (.edu) TDL. When they did this, it disappeared from view. This is a note to watch for IT types that think they may know how to optimize for SE’s. Another example: Lifeline Systems had 2 URL’s. Previously, all business came from health care professionals. So many links came from various health care partners and hospitals. When they decided to launch a consumer campaign, they changed URL and caused much confusion. They ended up getting banned, but it was fixed.

Dynamic URL’s: short introduction because Mikkel will be addressing, but look for pages in results without real Tile or description and the “we have removed pages that were similar” at the bottom of the results. 301 redirects are the “hero.” Highly recommends using these in the case of removing duplicate content. If you have been banned, use Google’s and Yahoo’s reinclusion requests.

Shari Thurow from Grandtastic Designs Will talk about some elements to the duplicate content filter, and why SE’s like to filter dupe content. Understand that duplicate content is an unclear definition. It can be an exact copy, or simply the change of a date or something that is “near-duplicate” content. #1 reason is that too much dupe content interferes with the retrieval of information process. SE’s and consumers both want fast results. They learned how to cluster by limiting results from one particular site to two maximum from a website. They omit duplicate results with the note that Anne mentioned above.

When SE’s look to determine uniqueness of each document, they use a “boiler plate” approach. This is also known as “block level analysis.” They omit the navigation bars, and focus on the main content areas. They look for every single webpage to have unique in-links (linkage properties). If two URL’s have the same in-links, there is a possibility of duplicate content. In the case of articles, they can be represented in more than one site because of the unique boiler plate and the unique linkage properties. They also look for “content evolution.” “In general 65% of web content does not change on a weekly basis. 0.8% will change completely on a weekly basis, such as a news site. They are looking at the “average page mutation” of a URL and a website. If too high, and the boiler plates are too similar, they will filter out the duplicate results.

Another type of filter is the “host name resolution.” Uses BMW example. If the host name resolves to the same company or organization, that organization is probably monopolizing control of content. If content is moved to different servers too often, they will look for that too, since “search engine spammers” tend to do this. Last thing: she is an Andre Broder “groupie.” All SE’s uses his concept of “shingle comparison.” You can search for a detailed explanation, but essentially, the more shingles of word groups (word sets) that are available on multiple URLs, the higher the likelihood of dupe content. SE’s only want one page. Shari doesn’t like when SE’s pick which one to display. She suggests determining the URL’s you want displayed if you have similar descriptions, and robots.txt exclude the other ones.

Some duplicate content is considered spam. This is where clustering has come in handy. The ability for SE’s to determine dupe content has improved greatly even since NYC SES. Shows an example of a University website that has a “hallway” page which is a “sitemap” of their doorway pages. The example has a link to “marketing degree,” followed by a link to “marketing degreeS.” Some duplicate content is a copyright infringement. She recommends using CopyScpae.com to help find people “swiping your content.” Also suggests using the “Waybackmachine” at archive.org to find copies of old content that may be in there and have been copied. Keep your own records with versions of your content. You have to prove this to the SE’s because they cannot take it down just on your word.

Summary: use Robots exclusion, have a copyrighting plan, keep track of you content, register with copyright.gov. Last but not least: don’t exploit the SE’s and help everyone’s experience.

Mikkel deMib Svendsen from RedZoneGlobal Will be covering some of the technical duplicate issues. The list is pretty long, so he could not possibly go through the entire list, so he will focus on more common tech issues: With and without “WWW” (canonical issues). Session ID’s, URL rewriting, many-to-one problems in forums, sort order parameters and breadcrumb navigation.

WWW or not WWW? Indexing issues: most engines seem to be able to deal with this. Linking issues: yes, still an issue, because you want people to link to the URL’s that the engines are prioritizing. So he recommends using 301 redirect to the most popular one. The WWW version seems to be most common.

Session ID’s can be a true nightmare. A website (smartpages.com) had 200,000 versions of its homepage indexed. Recommended solution is to dump session info into a cookie for all users, or id spiders and feed them their own content. Google recommends this – it is not “cloaking.” You are striping out info to help the engines.

Wordpress and other blog solutions. Rewrite URL’s without parameters by identifying what you want used (such as post name and /or post ID). The problem is that when you do this, you still have the re-written URL and the original URL. Even thought here are no links to this, the engines end up fin ding them. It only takes one link to the original version for it to be crawled and a dupe issue. Once again, use 301 redirects to fix. In Wordpress there is a plugin you can use to automatically 301 redirect the original to the next. Uses a tool (Schlueterica Wordpress Canonical URL plugin) for this (isaacschlueter.com/plugins/i-made/cannonical-url/).

“Many-to-one” in forums where a page can be requested in different manners, such as the URL or the URL appended with a Post ID and/or a Thread Page ID. There are some workarounds, he feels the one SEW is using could be improved. There is no linking solution. His idea is once again that you have to 301 redirect. Sort order parameters in the URL will cause duplicate indexing as well. Is yet another waste of the link popularity.

Breadcrumb navigation: you may have a problem if the navigation, you may have a problem, especially if you categorize a product in multiple ways. The first thing is that unique pages should also have one unique URL. If you want to reflect breadcrumbs, he suggests storing them in a cookie. Remember that we have infinite ways to create multiple URL’s to a different page.

QA Speakers include Tim Converse from Yahoo!, Matt Cutts from Google, and Rahul Lahiri from Ask.com

Tim: all of the info that was presented was very accurate. He wants to distinguish inadvertent duplication from “spamming.” The perfect thing for a crawler would be to have one URL for every unique page, but he knows that people don’t necessarily think about SE’s only when designing. Do not worry about getting banned for inadvertent issues. The easier you can make it for us, the better it will be. The ultimate goal for everyone is to present a diverse result for the user. If you stray into creating multiple domains with the same content or “repurposing” someone else’s content, they find this to be abusive. Ask yourself “do I own the content?”

Matt: again very good information. Of course the goal is to have good content. There is definitely abusive behavior. People often ask,” I have a .DE and a .FR of a site.” This is not a problem. If you are taking the time to translate in to multiple languages, no problem. If you have an article in four pages also rendered in one page, that is ok, but suggest blocking the spider from one. If you have “near duplicates,” with different boiler plates but identical/similar main content. When G crawls and presents search results, even to the last millisecond, they remove possible duplicates. Last point: link consistency. He would recommend that all links be to WWW. Just this past Friday, they renamed Sitemaps to Google Webmaster Tools. There is a tool there that if you want to notify them that you “own” the link at DMOZ that goes to the non-www version, you can report it and they will take it into consideration.

Rahul also aggress with the other speakers and discusses a couple examples of duplicate content and how they are trying to deal with it. He discusses how the Nestle Brands of water which are branded differently in different geographical areas (Poland Springs, etc…) each have their own URL, with identical sites that only change the brand name.

Previous story: Search Arbitrage Issues
 

Comments:

dude999

08/08/2006 08:19 pm

Very informative, thank you.

Kevin Heisler

08/08/2006 08:20 pm

Chris -- great summary of an excellent panel. However, to prove how difficult it is for Google to remove duplicate content, do a search on "Andre Broder" and "shingle comparison." BizResearch is the #1 natural search result with a competing blog summary of the panel. NetShaq is #2 with spam. At #3 is SEOData, with a republished copy of your article posted in real-time by reBlogger. http://www.seodata.com/PageRank-SERP-PR-Google/re-22336_PageRank-SERP-PR-Google---Duplicate-Content-Issues.aspx Although it's a terrific service, CopyScape can't locate the plagiarized web page. The result? SERoundtable -- the only authentic result -- is pushed down to #4 of 4 results. Kevin Heisler Proof Theory

Barry Schwartz

08/08/2006 09:54 pm

That is because Google doesnt trust this site at the time being. Working it out with them now. Don't worry, we will be number one soon.

chris boggs

08/09/2006 04:39 pm

Thanks Dude for the kudos, and Kevin for the astute observations. Barry maybe we need to have a liitle talk with Mr. C? :P

Barry Schwartz

08/10/2006 02:32 pm

All better now http://www.google.com/search?q=Duplicate%20Content%20Issues

chris boggs

08/11/2006 02:25 pm

nice! ;)

Drew

08/17/2006 01:55 pm

I was in SES too and the spelling you have is not correct. I almost went crazy looking for more info. I even looked on her site for any links to this guy. What I found was this: http://search.yahoo.com/search?p=andrei+broder+shingles&sp=1&fr2=sp-top&prssweb=Search&ei=UTF-8&fr=sfp&ei=UTF-8&SpellState=n-1287165022_q-u1ro7cx%2FJNs8MmjqpFjiWQABAA%40%40 So his name is Andrei Broder Here is one of his documents: http://www.cs.princeton.edu/courses/archive/spring05/cos598E/bib/CPM%202000.pdf

blog comments powered by Disqus