Google Can Combine URLs Before Crawling

Jul 27, 2018 - 7:51 am 5 by

Google Crystall Ball

Google doesn't want to list the same content in the search results for a single query, so they do a lot of duplication detection to fold together URLs and content so they don't show a messy set of search results. One would assume that Google needs to first crawl the pages to see the content and meta information before folding and consolidating those signals into a single page. But no, Google's John Mueller said often they can do this before they even crawl the site.

Google's John Mueller said in a webmaster hangout this morning at the 24 second mark that "we also do that essentially before crawling." Meaning, Google also can fold content together before they even crawl the site. "Where we look at the the URLs that we see and based on the information that we have from the past we think well probably these URLs could end up being the same and then we'll fold them together," John Mueller added.

He said they can do that "when we can recognize clear patterns." "Especially within a website where we can see well everything in this subdirectory here is the same thing as in this sub domain because of the way that the hosting is set up so we can just kind of like blindly assume that these URLs are the same without actually looking at them," he added.

I assume some of this is based on canonicals, redirects and basic CMS structures that show clear patterns and signals that Google can pick up on faster than crawling the whole site.

Here is the video embed:

Here is the transcript, although the video was a bit choppy for me:

We we try to do that when we look at the content. That's kind of like after indexing we've seen that these two pages are the same or almost the same, then we can fold them together.

But we also do that essentially before crawling. Where we look at the the URLs that we see and based on the information that we have from the past we think well probably these URLs could end up being the same and then we'll fold them together.

That usually makes sense when we can recognize clear patterns. Especially within a website where we can see well everything in this subdirectory here is the same thing as in this sub domain because of the way that the hosting is set up so we can just kind of like blindly assume that these URLs are the same without actually looking at them.

Forum discussion at YouTube.

 

Popular Categories

The Pulse of the search community

Follow

Search Video Recaps

 
Google Core Update Volatility, Helpful Content Update Gone, Dangerous Search Results & Ads Confusion - YouTube
Video Details More Videos Subscribe to Videos

Most Recent Articles

Search Forum Recap

Daily Search Forum Recap: April 15, 2024

Apr 15, 2024 - 4:00 pm
Google Search Engine Optimization

Google Goes On Defensive On Its Search Quality & Forum Results Statements

Apr 15, 2024 - 7:51 am
Google Search Engine Optimization

Google Responds To The Verge Mocking Its Search Rankings For Best Printer

Apr 15, 2024 - 7:41 am
Google News

Google Threatens California: Tests Removing Links To Publishers & Pauses Investments

Apr 15, 2024 - 7:31 am
Google Search Engine Optimization

Google Crawl Budget Is Allocated By Hostname

Apr 15, 2024 - 7:21 am
Google AdSense

Google AdSense Publishers Reporting Huge RPM Earnings Drops

Apr 15, 2024 - 7:11 am
Previous Story: Google Events Rich Results Boosts Events Search Results