Google On Removing Redacted Information From Search

Feb 1, 2021 • 7:11 am | comments (0) by twitter | Filed Under Google Search Engine
 

Google's John Mueller highlighted a document, which he said is not new, but it is useful, named keep redacted information out of Google Search. This document gives you tips on how to help ensure Google does not find this content in the first place and then ends with how to remove the content if it is found.

John posted this on Twitter "Redacted information & Google Search -- not particularly new, but worth keeping in mind for when you need to be sure."

Most of you reading this via a Google Search already probably have information you want removed from Google. So let's jump to that advice:

What to do if unredacted or improperly redacted documents are indexed in Search

  • Remove the live document from the website or location where you published it.
  • Use the Removals tool for the verified site to remove the documents in question from Search. Use a URL prefix if you need to remove many documents. For verified sites, a URL removal generally takes less than a day. This prevents the document in question from appearing for any searches for redacted content.
  • Host the properly redacted document under a different URL. This makes sure that any newly indexed version is of the new document, and not an older version of the document (since recrawling of URLs and updating them in a search index can take a bit of time). Update any links to those documents.
  • Contact any other site that may also be hosting the improperly redacted documents and ask them to take them down as well. Ask them to use the Removals tool in their Search Console account, or you can use the Outdated Content tool to ask Google's systems to update the search results.
  • Allow the URL removal requests to expire (this happens after the URLs were either updated in Google search index, or after about 90 days).

Google, before this, gave advice on tips you can jot down to make sure that redacted information is not even found in Google Search. Most of this may seem obvious but it is a good read if you work with a lot of legal or sensitive information in your job.

Here is what Google wrote.

Edit and export images before embedding them

Google Search lists images that it finds across the web, both those that are on web pages or those that are embedded into various document formats. Embedded images are sometimes edited using only the containing document's editing tools. This can cause this redaction to fail when an image is indexed apart from the document. That is why it's best to edit images before embedding them into a document, not after. In particular:

  • Crop out unwanted information from images before embedding them into documents. Some document editing tools (such as word processors or slide creation tools) will maintain any uncropped images that you use in the public version of the document, so be sure to review the tool's documentation thoroughly.
  • Completely remove or obscure any text or other non-public parts of the image, as OCR systems may turn any image text seen into searchable text.
  • Remove any undesired metadata.

After following the suggestions above, export or save the updated images as non-vector or flattened image file formats such as PNG or WEBP. This prevents those parts of the images from being inadvertently included in a public document.

Edit or remove unwanted text before moving to a public file format

Before you generate the public document, remove any text that you don't want displayed in the final version of the file. Move to a public format that does not keep your previous change history. Here are more specific tips:

  • Use proper document redacting tools if a file needs to have information redacted. For example, avoid placing black rectangles over text as a redaction method, as this can result in the text still being included in the public document.
  • Double-check the document metadata in the public file.
  • Follow the document redaction best practices for the format that you are using (PDF, image, etc).
  • Consider information in the URL or file name itself. Even if a part of a website is blocked by robots.txt, the URLs may be indexed in search (without their content). Use hashes in URL parameters instead of email addresses or names.
  • Consider using authentication to limit access to the redacted content. Serve the resulting login page with a noindex robots meta tag to block indexing.
  • When publishing, make sure that the website is verified in Google Search Console. This allows quick removal action, should it be needed.

Why did he share this now if it is not new? Well, I think Google cleaned up the document a few weeks ago and then someone recently asked John about how some content got into search.

Forum discussion at Twitter & WebmasterWorld.

Previous story: Daily Search Forum Recap: January 29, 2021
 
blog comments powered by Disqus