Sebastian's Pamphlets: In need of a "Web-Robot Directives Standard"

The Robots Exclusion Protocol from 1994 gets used and abused, best described by Lisa Barone citing Dan Crow from Google: "everyone uses it but everyone uses a different version of it". De facto we've a Robots Exclusion Standard covering crawler directives in robots.txt and robots meta tags as well, said Dan Crow. Besides non-standardized directives like "Allow:", Google's Sitemaps Protocol adds more inclusion to the mix, now even closely bundled with robots.txt. There are more ways to put crawler directives. Unstructured (in the sense of independence from markup elements) like with Google's section targeting, on link level applying the commonly disliked rel-nofollow microformat or XFN, and related thoughts on block level directives.

All in all that's a pretty confusing conglomerate of inclusion and exclusion, utilizing many formats respectively markup elements, and lots of places to put crawler directives. Not really the sort of norm the webmaster community can successfully work with. No wonder that over 75,000 robots.txt files have pictures in them, that less than 35 percent of servers have a robots.txt file, that the average robots.txt file is 23 characters ("User-agent: * Disallow:"), gazillions of Web pages carry useless and unsupported meta tags like "revisit-after" ... for more funny stats and valuable information see Lisa's robots.txt summit coverage (SES NY 2007), also covered by Tamar (read both!).

How to structure a "Web-Robot Directives Standard"?

To handle redundancies as well as cascading directives properly, we need a clear and understandable chain of command. The following is just a first idea off the top of my head, and likely gets updated soon:

Robots.txt
1. Disallows directories, files/file types, and URI fragments like query string variables/values by user agent.
2. Allows sub-directories, file names and URI fragments to refine Disallow statements.
3. Gives general directives like crawl-frequency or volume per day and maybe even folders, and restricts crawling in particular time frames.
4. References general XML sitemaps accessible to all user agents, and specific XML sitemaps addressing particular user agents as well.
5. Sets site-level directives like "noodp" or "noydir".
6. Predefines page-level instructions like "nofollow", "nosnippet" or "noarchive" by directory, document type or URL fragments.
7. Predefines block-level respectively element-level conditions like "noindex" or "nofollow" on class names or DOM-IDs by markup element. For example "DIV.hMenu,TD#bNav 'noindex,nofollow'" could instruct crawlers to ignore the horizontal menu as well as navigation at the very bottom on all pages.
8. Predefines attribute-level conditions like "nofollow" on A elements. For example "A.advertising REL 'nofollow'" could tell crawlers to ignore links in ads, or "P#tos > A 'nofollow'" could instruct spiders to ignore links in TOS excerpts found on every page in a P element with the DOM-ID "tos".
- XML Sitemaps
- 1. Since robots.txt deals with inclusion now, why not add an optional URL specific "action" element allowing directives like "nocache" or "nofollow"? Also a "delete" directive to get outdated pages removed from search indexes would make sound sense.
  2. To make XML sitemap data reusable, and to allow centralized maintenance of page meta data, a couple of new optional URL elements like "title", "description", "document type", "language", "charset", "parent" and so on would be a neat addition. This way it would be possible to visualize XML sitemaps as native (and even hierarchical) site maps.
  Robots.txt exclusions overrule URLs listed for inclusion in XML sitemaps.
  - Meta Tags
  - Page meta data overrule directives and information provided in robots.txt and XML sitemaps. Empty contents in meta tags suppress directives and values given in upper levels. Non-existent meta tags implicitly apply data and instructions from upper levels. The same goes for everything below.
    - Body Sections
    - Unstructured parenthesizing of parts of code certainly is undoable with XMLish documents, but may be a pragmatic procedure to deal with legacy code. Paydirt in HTML comments may be allowed to mark payload for contextual advertising purposes, but it's hard to standardize. Lets leave that for proprietary usage.
      - Body Elements
      - Implementing a new attribute for messages to machines should be avoided for several good reasons. Classes are additive, so multiple values can be specified for most elements. That would allow to put standarized directives as class names, for example class="menu robots-noindex googlebot-nofollow slurp-index-follow" where the first class addresses CSS. Such inline robot directives come with the same disadvantages as inline style assignments and open a can of worms so to say. Using classes and DOM-IDs just as a reference to user agent specific instructions given in robots.txt is surely the preferable procedure.
        
        Element Attributes
        
        More or less this level is a playground for microformats utilizing the A element's REV and REL attributes.

Besides the common values "nofollow", "noindex", "noarchive"/"nocache" etc. and their omissible positive defaults "follow" and "index" etc., we'd need a couple more, for example "unapproved", "untrusted", "ignore" or "skip" and so on. There's a lot of work to do.

In terms of of complexity, a mechanism as outlined above should be as easy to use as CSS in combination with client sided scripting for visualization purposes.

However, whatever better ideas are out there, we need a widely accepted "Web-Robot Directives Standard" as soon as possible.