The REP is not only a theatre, it stands for Robots Exclusion Protocol (robots.txt and robots meta tag). Everything you can shove into a robots meta tag on a HTML page can now be delivered in the HTTP header for any file type:
- INDEX|NOINDEX - Tells whether the page may be indexed or not
- FOLLOW|NOFOLLOW - Tells whether crawlers may follow links provided on the page or not
- ALL|NONE - ALL = INDEX, FOLLOW (default), NONE = NOINDEX, NOFOLLOW
- NOODP - tells search engines not to use page titles and descriptions from the ODP on their SERPs.
- NOYDIR - tells Yahoo! search not to use page titles and descriptions from the Yahoo! directory on the SERPs.
- NOARCHIVE - Google specific, used to prevent archiving (cached page copy)
- NOSNIPPET - Prevents Google from displaying text snippets for your page on the SERPs
- UNAVAILABLE_AFTER: RFC 850 formatted timestamp - Removes an URL from Google's search index a day after the given date/time
So how can you serve X-Robots-Tags in the HTTP header of PDF files for example? Here is one possible procedure to explain the basics, just adapt it for your needs:
Rewrite all requests of PDF documents to a PHP script knowing wich files must be served with REP header tags. You could do an external redirect too, but this may confuse things. Put this code in your root's .htaccess:
RewriteEngine On
RewriteBase /pdf
RewriteRule ^(.*)\.pdf$ serve_pdf.php
In /pdf you store some PDF documents and serve_pdf.php:
...
$requestUri = $_SERVER['REQUEST_URI'];
...
if (stristr($requestUri, "my.pdf")) {
header('X-Robots-Tag: index, noarchive, nosnippet', TRUE);
header('Content-type: application/pdf', TRUE);
readfile('my.pdf');
exit;
}
...
This setup routes all requests of *.pdf files to /pdf/serve_pdf.php which outputs something like this header when a user agent asks for /pdf/my.pdf:
Date: Tue, 31 Jul 2007 21:41:38 GMT
Server: Apache/1.3.37 (Unix) PHP/4.4.4
X-Powered-By: PHP/4.4.4
X-Robots-Tag: index, noarchive, nosnippet
Connection: close
Transfer-Encoding: chunked
Content-Type: application/pdf
You can do that with all kind of file types. Have fun and say thanks to Google :)
Sebastian - Great post.
ReplyDeleteI didn't have any idea about that - plan to write a post referencing you.
Cheers,
Matt
Thanks Matt :)
ReplyDeleteHere is another good link from Hamlet Batista:
serving X-Robots-Tags with SetEnvIf and Header add in .htaccess - neat :)
Sebastian - That is very clever. Good job!
ReplyDeleteI thought about doing something similar, but decided the .htaccess solution might be easier for non-programmers.
I thought about using ScriptAliasMatch to map files to a cgi script that would add the header and using PATH_INFO to find the file on disk.