Google News Can't Index Articles With Too Much HTML Formatting

Feb 23, 2011 • 8:59 am | comments (3) by twitter Google+ | Filed Under Google Search Engine Optimization
 

Google News IconI spotted an interesting thread at the Google News Help forum where one site was complaining their articles weren't being included in Google News and Google replied the reason was because some of the formatting tags weren't recognized.

What is interesting is that the specific tags called out by Google as the issue were standard paragraph break tags.

Harvey P. from Google said:

In reviewing your site, I found a couple of things that may be preventing our crawler from indexing your articles. In the HTML code of article pages, you use many formatting tags such as <p> and <br> that may cause problems for our crawler. Removing frequent use of these tags may help our system better identify and index your articles.

I looked at the site in question and picked a random article and it didn't seem out of the ordinary. The code, including the <p> and <br> used throughout the body content, didn't seem atypical.

click for full size

So I am not sure if there was a specific article that had too much HTML formatting in it?

We do get errors on some of our articles, specifically the daily recap posts. Specifically, the error we get is Article fragmented which means:

The article body that we extracted from the HTML page appears to consist of isolated sentences not grouped together into paragraphs. We generated this error to avoid including what might be an incorrect piece of text.

Recommendations

* Try formatting your articles into text paragraphs of a few sentences each.
* Make sure your sentences are well punctuated.
* Make sure you don't use frequent <p> and <br> tags within your paragraphs, and try to avoid breaking up the article body in general.
* Consider removing some of the non-article text from the article page.

So I suspect there is a specific form of articles that are not properly structured in which Harvey is responding to.

Forum discussion at Google News Help.

Previous story: Should You Host Images On Your Domain or Flickr For Best Traffic Potential?
 

Comments:

Chris Lake

02/23/2011 02:18 pm

Well this is a bit sucky isn't it? In addition, I've long suspected that articles with YouTube / video embed also cause problems for Google News (sometimes they seem to return the Article Too Long error message via Webmaster Tools, and I can only assume it is the embed code that is breaking the camel's back).

Barry Schwartz

02/23/2011 03:00 pm

Nah, those YouTube embeds are fine, as long as you have a lot of content around it.

Paul Mackenzie Ross

02/23/2011 05:54 pm

Sure, there's nothing too unusual about that markup but it is a bit sloppy and indicative of a copy-and-paste form of "publishing" i.e. copy from document, paste in WYSIWYG editor, hit the publish button et voila, you're a news publisher. Personally, as an editor and a designer, I expect genuinely newsworthy content to have to been treated with the same care and attention in the markup department as the level of professionalism that the agency publishing the news has. Pedantic? Maybe, but a paragraph is a paragraph not a <p> tag punctuated with numerous &ltbr /&gt tags and all that superfluous whitespace isn't helpful in markup either.

blog comments powered by Disqus