Sebastian's Pamphlets: Is XML Sitemap Autodiscovery for Everyone?

Referencing XML sitemaps in robots.txt was recently implemented by Google upon requests of webmasters going back to June, 2005, shortly after the initial launch of sitemaps. Yahoo, Microsoft, and Ask support it, whereby nobody knows when MSN is going to implement XML sitemaps at all.

Some folks argue that robots.txt introduced by the Robots Exclusion Protocol from 1994 should not get abused by inclusion mechanisms. Indeed this may create confusion, but it was done before, for example by search engines supporting Allow: statements introduced 1996. Also, the de facto Robots Exclusion Standard covers robots meta tags --where inclusion is the default-- too. I think dogmatism is not helpful when the actual needs require evolvement.

So yes, the opportunity to address sitemaps in robots.txt is a good thing, but certainly not enough. It simplifies the process, that is auto detection of sitemaps eliminates a few points of failure. Webmasters don't need to monitor which engines implemented the sitemaps protocol recently, and submit accordingly. They can just add a single line to their robots.txt file and the engines will do their job. Fire and forget is a good concept. However, the good news come with pitfalls.

But is this good thing actually good for everyone? Not really. Many publishers have no control over their server's robots.txt file, for example publishers utilizing signup-and-instantly-start-blogging services or free hosts. As long as these platforms generate RSS feeds or other URL lists suitable as sitemaps, the publishers must submit to all search engines manually. Enhancing the sitemaps auto detection by looking at page meta data would be great: <meta name="sitemap" content="http://www.example.com/sitemap.xml" /> or <link rel="sitemap" type="application/rss+xml" href="http://www.example.com/sitefeed.rss" /> would suffice.

So far the explicit diaspora. Others are barred from sitemap autodiscovery by lack of experience, technical skills, or manageable environments like at way to restrictive hosting services. Example: the prerequisites for sitemap autodetection include the ability to fix canonical issues. An XML sitemap containing www.domain.tld-URLs referenced as Sitemap: http://www.domain.tld/sitemap.xml in http://domain.tld/robots.txt is plain invalid. Crawlers following links without the "www" subdomain will request the robots.txt file without the "www" prefix. If a webmaster running this flawed but very common setup relies on sitemap autodetection, s/he will miss out on feedback respectively error alerts. On some misconfigured servers this may even lead to deindexing of all pages with relative internal links.

Hence please listen to Vanessa Fox stating that webmasters shall register their autodiscovered sitemaps at Webmaster Central and Site Explorer to get alerted on errors which an XML sitemap validator cannot spot, and to monitor the crawling process!

I doubt many SEO professionals and highly skilled Webmasters managing complex sites will make use of that new feature. They prefer to have things under control, and automated 3rd party polls are hard to manipulate. Probably they want to maintain different sitemaps per engine to steer their crawling accordingly. Although this can be accomplished by user agent based delivery of robots.txt, that additional complexity doesn't make the submission process easier to handle. Only uber-geeks automate everything ;)

For example it makes no sense to present a gazillion of image- or video clip URLs to a search engine indexing textual contents only. Google handles different content types extremely simple for the site owner. One can put HTML pages, images, movies, PDFs, feeds, office documents and whatever else all in one sitemap and Google's sophisticated crawling process delivers each URL to the indexer it belongs to. We don't know (yet) how other engines will handle that.

Also, XML sitemaps are a neat instrument to improve crawling and indexing of particular contents. One search engine may nicely index insuffient linked stuff, whilst another engine fails to discover pages buried more than two link levels deep, badly needing the hints via sitemap. There are more good reasons to give each engine its own sitemap.

Last but not least there might be good reasons not to announce sitemap contents to the competition.