Google's "Quick View" PDF Also Does OCR Conversion For Many Languages

Oct 9, 2009 • 8:22 am | comments (5) by twitter Google+ | Filed Under Google Search Engine
 

Google recently announced a feature that they have implemented just a couple weeks ago in the search results named "Quick View." Quick View basically shows you a PDF in a web based PDF viewer on Google. It takes the PDF from the host, typically the owner of the PDF, and does all the conversion on the Google's server.

The neat part is that this feature gives you OCR for virtually all of the languages Google has translation for. I'll get to that in a bit, first let me show you a basic example of how Quick View works and then I'll show you the translation OCR.

A search for [w4] returns the IRS's web site with the PDF of a W-4 form.

Google PDF Quick View & OCR

When you click on the Quick View link in the search results, you get this page:

Google PDF Quick View & OCR

Yes, a neat view of the PDF, the ability to download the file, print it or convert it to plain html. A WebmasterWorld thread has webmasters who are not happy about this because this bypasses your site and you get no traffic benefit from this. Tedster explains:

So it looks like one more way that Google Search can distribute a site's content without requiring a direct visit to the site itself - and in this case, it's an entire document, not just a snippet. And the intention is to roll this out for other file format types, too.

To make things even worse, from a copyright standpoint is the OCR technology. I can upload a book, in almost any language, let Google index it as a PDF and then convert it to plain HTML and copy and paste from there.

For example, this hebrew book in Quick View looks like this:

Google PDF Quick View & OCR

If you click the "Plain HTML" link you are taken here where Google has OCRed the text into copy and paste friendly Hebrew. Pretty neat! Well, to some, not to those that might own the copyright on this text.

Forum discussion at WebmasterWorld.

Previous story: Daily Search Forum Recap: October 8, 2009
 

Comments:

Todd Hebert

10/09/2009 12:50 pm

It seems good for the end user however for the site owner this is wrong.

Ionut

10/09/2009 02:32 pm

I don't see any example of OCR in your post. Both documents include text, not scanned images. The text is available if you open the files using Adobe Reader. Regarding the webmasters who are not happy about this, there's no real difference between the previous option to view files as HTML and this enhanced document viewer. When you click on the title of the search result, you'll still open the document in your browser.

Barry Schwartz

10/09/2009 02:38 pm

Ionut, the Hebrew document is a scan of a book. Google is OCRing that into plain text in the HTML version.

Lucky Balaraman

10/11/2009 10:37 am

This fact throws up a new golden rule: Always mention your site address on your online PDFs with the call to action, "For more important information on this subject, go to http://yoursite.com."

bd_

10/12/2009 02:00 am

In the specific Hebrew document linked, the original PDF already has the OCR'd text in it. If you open it up right there on your computer with any random PDF reader, you can copy-and-paste it without Google's help. Google's just doing what every other PDF reader does - make that second plain-text layer copy-and-pastable.

blog comments powered by Disqus