Robot.txt Overrides Meta Tags

Ryuzaki

お前はもう死んでいる
Moderator
BuSo Pro
Digital Strategist
Joined
Sep 3, 2014
Messages
6,244
Likes
13,129
Degree
9
A heads up for the builders.

I'm in the process of doing massive behind the scenes work for my case study site here, and part of that is crafting a giant robots.txt.

Building a database'd site with templates means you're going to have one Header file being pulled for every page. I hate plugin dependency and don't really trust myself to create a custom set of tables for the purpose of slapping in customized meta tags in the header. The best I could do is create a sophisticated if/elseif/else loop to add tags into the right pages. That's dumb to me at this level of the game.

So I thought, hey, I'll just tell the search engines to not crawl these pages I don't want indexed, and i'll do so in the robots.txt....

It's not going to cut it. For example:

dXfBdZi.gif


About.com told Google specifically, don't crawl this folder. Don't crawl it should mean "don't even look at it." Yet look at this:

K3ElNVK.gif


Google went ahead and indexed the 2760 urls from that folder. They aren't showing the content, title, descriptions, etc., but they are indexing the URLs.

Do they show up for any legit search terms? Probably not. Could our sites get caught in some stupid Panda crossfire due to something like this? Most likely.

So my heads up is to definitely make sure you use:

<meta name="robots" content="noindex">
If you want to make sure a page isn't indexed. And if you do this, don't give the same directives in robots.txt or it will take precedence and google will ignore the meta tags. My guess is that, since they are choosing to disobey robots.txt directives (guaranteed they crawl the pages for their own data, as the pictures above essentially prove), they see the meta tags and ignore them as well.

But they seem to not ignore the meta tags when the robots.txt isn't "confusing" their crawlers. They crawl and follow but will at least not index.

TL;DR
Don't double up duties in the robots.txt and meta robots tags. Use meta tags when you can. Yoast's Wordpress plugin makes this easy in Wordpress for instance. Find a solution and do it right, or you'll end up like About.com. Search engines don't have to obey robots.txt or meta tags, so try to figure out which they are choosing to respect and go with that. In this case, Google will respect meta robots tags as far as indexing goes as long as you don't double up in the robots.txt.
 
Problem is that both robots.txt and meta tags do not have to be honored.
It is considered "good manners" but it is not binding for any robot or user agent.

::emp::
 
Frustrating, and the last paragraph mentions it here on g support (cached version, live down for me). Ive also found 301 ing to a blocked url slaps it in the index.
 
What you don't know of course is when that specific robots.txt direcitive was added, Google will index that shit until they find the file updated.

Pro-Tip. Don't add Google+ button to any pages you don't want indexing, specifically dev sites to the retard devs who can't build their development sites properly!
 
Pro-Tip. Don't add Google+ button to any pages you don't want indexing, specifically dev sites to the retard devs who can't build their development sites properly!
You only need a plugin/addon that uses any Google services in the background, like translation, PR check etc. Back in the day my favorite trick to get new pages indexed without a hassle was to do a simple pagerank check for them. Google's toolbar was a mighty effective way to crowdsource index building.
 
My guess is that, since they are choosing to disobey robots.txt directives (guaranteed they crawl the pages for their own data, as the pictures above essentially prove), they see the meta tags and ignore them as well.

I've had sites where Google has ignored both the Robots text file and the Robots meta tag, together and individually. One other option is to make sure you exclude these pages from your sitemap, and also to specifically exclude them in Webmaster Tools. But in the end, Google does what it wants.
 
Back