Indexing and Crawling — Hints vs. Directives

A word or 300 about using “noindex,follow” and other indexing signal factors…
 
Cartoon about how many things in SEO require a response of 'it depends'Questions have been asked from people in the SEO community steadily over the years about using “noindex,follow”, or “canonical tags pointing to other pages”, as ways to get Google to act a certain way.  It’s a subject that’s come up from time to time recently in a group I’m an admin of over on Facebook — “Dumb SEO Questions”.  
In providing some insight there today, I thought it would be a strong enough topic to deserve my writing a blog post about it.
 
Note this is my perspective and experience to those lines of thinking. It doesn’t mean what I convey here is absolute, set in stone, or applicable to every site and every situation.  With all things SEO, there are edge case scenarios where something else may be true.  So take what I offer here and do with it what you will.

Google — The All-Knowing Decider of Indexing

Google, being the all wise cataloger of the web (in their view), does not truly respect robots.txt, meta robots tags, x‑robots tags or canonical tags, in spite of those each having the purpose of being a directive. Google’s programmers, in their view of the world, consider each such signal only as a “hint” as to what site owners intend.
_______________________________________

Conflicts Within Site Signals Muddy The Process

 Of course, many sites inject conflicts across that range of signals, which is the altruistic reason Google decided long ago to only take them as hints. Except that means the imperfect algorithmic process quite often makes a poor determination as to what the system “should” do if not what those signals convey.
 
Because of all of that, even including a URL type in robots.txt or canonicalizing to a different URL or having a meta robots noindex status can sometimes not prevent Google from not only crawling some URLs but also where they end up indexing them.
 _______________________________________

Crawled Not Indexed

 “Most” of the time (a relative concept), if a file is listed in the robots.txt file, even when Google crawls them, they will list those URLs in search results, yet they will have at least honored the spirit of the robots.txt file by not indexing the content of those URLs. It’s pretty insane sometimes.
 _______________________________________

The “Noindex,Follow” Way to Wealth, Fame and Lost Value

 As for “noindex,follow”, there’s never a valid reason to use that combination. Sure, Google will pass value THROUGH those pages, initially. Except if the noindex aspect remains long enough, Google will eventually stop even passing the value through them — they’ll end up being removed from the Google process entirely. 
John Mueller confirmed this in a  Webmaster Hangout — Barry Schwartz shared the video over on SEO Roundtable .
 
And from the perspective of consistency of signals, if a page deserves “noindex” status, it’s best to use “nofollow” as well, so as to more readily convey what you do want indexed and crawled simultaneously. For larger sites this is even more important because of crawl budget considerations, where such a factor starts to become integral to getting signals correctly understood in formulaic processing.
 _______________________________________

But Wait — There’s More to It!

 Then there’s the notion that other signals get factored in to all of this as well.  If enough pages are deemed, by Google’s systems, to be unworthy of indexing for other reasons — too much duplication, not enough unique value, not enough trust, and other examples, those pages won’t always be indexed in spite of other signals. Or they may be indexed, yet not helpful.  In fact, Google will sometimes index pages that don’t ultimately deserve to be indexed, and that alone can weaken the value of pages that do deserve to be indexed.
 _______________________________________

URL Parameters, Sitemap XML Files, and Inbound Links

Whenever discussing the crawl and indexation decision process, it’s important to also mention the fact that URL paramters, when set to “representative URL”, or “Let Googlebot Decide” can also muck up that decision process. The same is true for inclusion in sitemap XML files and when enough high value inbound links exist.  All of these can influence, to varying degrees, how and what Google crawls and ultimately indexes.  In spite of robots settings or canonical tags.

_______________________________________

The SEO Indexing Bottom Line — Consistency

Okay that wasn’t the bottom line. It was just the last section label in this post.  The best recommendation I have is one I repeat often in my audit work — Never leave it to Google to “figure it all out” when you have the ability to control, through consistency of signals, what you want their systems to do and how you want their systems to behave regarding your site.

 

Published by

Alan Bleiweiss

Alan Bleiweiss is a professional SEO consultant specializing in forensic audits, related consulting, client and agency training, and speaking to audiences of all sizes on all things SEO.

6 thoughts on “Indexing and Crawling — Hints vs. Directives”

  1. It’s interesting to hear your comment about canonicals. Canonicals are something still new to me (only because I choose to focus my efforts on broader topics such as audits and consulting). But you’re the first authority I’ve heard mention Google does not truly respect canonicals.

    But come to think of it, that has to be true.

    Google has guidelines and recommendations, but that’s it. There’s no sure-fire way of performing many tasks when it comes to indexing a website properly. Google uses custom software, not a single human being to index websites.

    A simple post, Alan, but very valuable to me. Thank you.

    1. JL,

      Google first confirmed it’s a hint, officially, in their own documentation many years ago. Here’s an entry from Google’s webmaster blog going back to 2009.
      ___________________________
      Is rel=“canonical” a hint or a directive?

      It’s a hint that we honor strongly. We’ll take your preference into account, in conjunction with other signals, when calculating the most relevant page to display in search results.
      ___________________________
      So it’s just a matter of knowing where to look, and also paying attention to as much as they communicate as is reasonable. Yet it’s also something I know because of how many audits I’ve done where Google has completely ignored mass volumes of canonicals due to other signals confusing them.

      Quote reference from:
      https://webmasters.googleblog.com/2009/02/specify-your-canonical.html

  2. Then is it safe to say, anything a webmaster has control over, Google will take their directives as only recommendations or a “hint”, but can choose otherwise the final outcome?

    Because it would not be in their best interest as a business, to allow a person outside of their company, to have control over how they display content.

    Have you come across an absolute process, to anything within search engine marketing where you can predict the outcome to an action taking by an SEO, every time?

    1. Essentially, that is the situation we face. And that’s exactly why I always recommend people need to be consistent in their signals where one signal can confirm or conflict with other signals.

      As for predicting outcome of any one thing, in every case, it’s tricky. There are “most of the time” scenarios, and “some of the time” scenarios for just about any one thing that can be done.

  3. I wholeheartedly agree on how Google treats robots.txt, meta robots tags, x‑robots tags and canonicals, Alan, as well as their reasons for failing to respect them when faced with conflicting or confusing signals. I’ve seen those conflicts cause serious crawl budget issues on sites with only a few thousand pages… I can only imagine how much worse it could become for a huge ecomm site. Great post!

    1. Thanks Doc. Yeah when it’s a half million pages that have conflicts, that takes a toll all over the place.

Leave a Reply

Your email address will not be published. Required fields are marked *