Sebastian's Pamphlets: Why proper error handling is important

Misconfigured servers can prevent search engines from crawling and indexing. I admit that's news of yesterday. However, standard setups and code copied from low quality resources are underestimated --but very popular-- points of failure. According to Google a missing robots.txt file in combination with amateurish error handling can result in invisibility on Google's SERPs. That's a very common setup by the way.

Googler Jonathon Simon said:

This way [correct setup] when the Google crawler or other search engine checks for a robots.txt file, they get a 200 response if the file is found and a 404 response if it is not found. If they get a 200 response for both cases then it is ambiguous if your site has blocked search engines or not, reducing the likelihood your site will be fully crawled and indexed.

That's a very carefully written warning, so I try to rephrase the message between the lines:

If you have no robots.txt and your server responds "Ok" (or 302 on a request of robots.txt followed by a 200 response on request of the error page) when Googlebot tries to fetch it, Googlebot might not be willing to crawl your stuff further, hence your pages will not make it in Google's search index.

If you don't suffer from IIS (Windows hosting is a horrible nightmare coming with more pitfalls than countable objects in the universe: go find a reliable host) here is a bullet-proof setup.

If you don't have a robots.txt file yet, create one and upload it today:

User-agent: *
Disallow:

This tells crawlers that your whole domain is spiderable. If you want to exclude particular pages, file-types or areas of your site, refer to the robots.txt manual.

Next look at the .htaccess file in your server's Web root directory. If your FTP client doesn't show it, add "-a" to "external mask" in the settings and reconnect. If you find complete URLs in lines starting with "ErrorDocument", your error handling is screwed up. What happens is that your server does a soft redirect to the given URL, which probably responds with "200-Ok", and the actual error code gets lost in cyberspace. Sending 401 errors to absolute URLs will slow your server down to the performance of a single IBM-XT hosting Google.com, all other error directives pointing to absolute URLs result in crap. Here is a well formed .htaccess sample:

ErrorDocument 401 /get-the-fuck-outta-here.html
ErrorDocument 403 /get-the-fudge-outta-here.html
ErrorDocument 404 /404-not-found.html
ErrorDocument 410 /410-gone-forever.html
Options -Indexes
<Files ".ht*">
deny from all
</Files>
RewriteEngine On 
RewriteCond %{HTTP_HOST} !^www\.canonical-server-name\.com [NC] 
RewriteRule (.*) http://www.canonical-server-name.com/$1 [R=301,L]

With "ErrorDocument" directives you can capture other clumsiness as well, for example 500 errors with /server-too-buzzy.html or so. Or make the error handling comfortable using /error.php?errno=[insert err#]. In any case avoid relative URLs (src attribute in IMG elements, CSS/feed links, href attributes of A elements ...) on all landing pages. You can test actual HTTP response codes with online header checkers.

The other statements above do different neat things. Options -Indexes disallows directory browsing, the next block makes sure that nobody can read your server directives, and the last three lines redirect invalid server names to your canonical server address.

.htaccess is a plain ASCII file, it can get screwed when you upload it in binary mode or when you change it with a word processor. Best edit it with an ASCII/ANSI editor (vi, notepad) as htaccess.txt on your local machine (most FTP clients choose ASCII mode for text files) and rename it to ".htaccess" on the server. Keep in mind that file names are case sensitive.

Labels: .htaccess, robots.txt, SEO

Stumble It!

Post it to
del.icio.us

3 Comments:

At Friday, March 02, 2007, Anonymous said…

Any article related to the .htaccess is always a mess, except from this one, really straight.

nice!
At Sunday, March 04, 2007, Mr. Apache said…

this article that shows every single Apache Status Code and the actual headers and src returned from that error! Force Apache to output any HTTP Status Code with ErrorDocument
At Sunday, March 04, 2007, Sebastian said…

Thanks Mr. Apache, your article is a nice addition to these resources:
http://www.w3.org/Protocols/rfc2616/rfc2616-sec10.html
http://www.w3.org/Protocols/rfc2616/rfc2616.html
http://httpd.apache.org/docs/1.3/mod/core.html
http://httpd.apache.org/docs/2.0/custom-error.html
More technical info here:
http://www.google.com/search?num=100&hl=en&safe=off&q=http+1.1+error+codes+site:apache.org+-inurl:mail-archives