Sebastian's Pamphlets

If you've read my articles somewhere on the Internet, expect something different here.

MOVED TO SEBASTIANS-PAMPHLETS.COM

Please click the link above to read actual posts, this archive will disappear soon!

Stay tuned...

Tuesday, April 24, 2007

Erol to ship a Patch Fixing Google Troubles

Background: read these four posts on Google penalizing respectively deindexing e-commerce sites. Long story short: Recently Google's enhanced algos began to deindex e-commerce sites powered by Erol's shopping cart software. The shopping cart maintains a static HTML file which redirects user agents executing JavaScript to another URL. This happens with each and every page, so it's quite understandable that Ms. Googlebot was not amused. I got involved as a few worried store owners asked for help in Google's Webmaster Forum. After lots of threads and posts on the subject Erol's managing director got in touch with me and we agreed to team up to find a solution to help the store owners suffering from a huge traffic loss. Here's my report of the first technical round.

Understanding how Erol 4.x (and all prior versions) works:

The software generates a HTML page offline, which functions as an XML-like content source (called "x-page", I use that term because all Erol customers are familar with it). The "x-page" gets uploaded to the server and is crawlable, but not really viewable. Requested by a robot it responds with 200-Ok. Requested by a human, it does a JavaScript redirect to a complex frameset, which loads the "x-page" and visualizes its contents. It responds to browsers if directly called, but returns a 404-NotFound error to robots. Example:

"x-page": x999.html
Frameset: erol.html#999x0&&

To view the source of the "x-page" disable JavaScript before you click the link.

Understanding how search engines handle Erol's pages:

There are two major weak points with regard to crawling and indexing. The crawlable page redirects, and the destination does not exist if requested by a crawler. This leads to these scenarios:

  1. A search engine ignoring JavaScript on crawled pages fetches the "x-page" and indexes it. That's the default behavior of yesterdays crawlers, and still works this way at several search engines.

  2. A search engine not executing JavaScript on crawled pages fetches the "x-page", analyzes the client sided script, and discovers the redirect (please note that a search engine crawler may change its behavior, so this can happen all of a sudden to properly indexed pages!). Possible consequences:

    • It tries to fetch the destination, gets the 404 response multiple times, and deindexes the "x-page" eventually. That would mean that depending on the crawling frequency and depth per domain the pages disappear quite fast or rather slow until the last page is phased out. Google would keep a copy in the supplemental index for a while, but this listing cannot return to the main index.

    • It's trained to consider the unconditional JavaScript redirect "sneaky" and flags the URL accordingly. This can result in temporarily and permanent deindexing as well.

  3. A search engine executing JavaScript on crawled pages fetches the "x-page", performs the redirect (thus ignores the contents of the "x-page"), and renders the frameset for indexing. Chances are it gives up on the complexity of the nested frames, indexes the noframe-tag of the frameset and perhaps a few snippets from subframes, considers the whole conglomerate thin, hence assignes the lowest possible priority for the query engine and moves on.

Unfortunately the search engine delivering the most traffic began to improve its crawling and indexing, hence many sites formerly receiving a fair amount of Google traffic began to suffer from scenario 2 -- deindexing.

Outlining a possible work around to get the deleted pages back in the search index:

In six months or so Erol will ship version 5 of its shopping cart, and this software dumps frames, JavaScript redirects and ugly stuff like that in favor of clean XHTML and CSS. By the way, Erol has asked me for my input on their new version, so you can bet it will be search engine friendly. So what can we do in the meantime to help legions of store owners running version 4 and below?

We've got the static "x-page" which should not get indexed because it redirects, and which cannot be changed to serve the contents itself. The frameset cannot be indexed because it doesn't exist for robots, and even if a crawler could eat it, we don't consider it easy to digest spider fodder.

Let's look at Google's guidelines, which are the strictest around, thus applicable for other engines as well:
  1. Don't [...] present different content to search engines than you display to users, which is commonly referred to as "cloaking."

  2. Don't employ cloaking or sneaky redirects.

If we find a way to suppress the JavaScript code on the "x-page" when a crawler requests it, the now more sophisticated crawlers will handle the "x-page" like their predecessors, that is they would fetch the "x-pages" and hand them over to the indexer without vicious remarks. Serving identical content under different URLs to users and crawlers does not contradict the first prescript. And we'd comply to the second rule, because loading a frameset for human vistors but not for crawlers is definitely not sneaky.

Ok, now how to tell the static page that it has to behave dynamically, that is outputting different contents server sided depending on the user agent's name? Well, Erol's desktop software which generates the HTML can easily insert PHP tags too. The browser would not render those on a local machine, but who cares when it works after the upload on the server. Here's the procedure for Apache servers:

In the root's .htaccess file we enable PHP parsing of .html files:
AddType application/x-httpd-php .html

Next we create a PHP include file xinc.php which prevents crawlers from reading the offending JavaScript code:
<?php
$crawlerUAs = array("Googlebot", "Slurp", "MSNbot", "teoma", "Scooter", "Mercator", "FAST");
$isSpider = FALSE;
$userAgent = getenv("HTTP_USER_AGENT");
foreach ($crawlerUAs as $crawlerUA) {
if (stristr($userAgent, $crawlerUA)) $isSpider = TRUE;
}
if (!$isSpider) {
print "<script type=\"text/javascript\"> [a whole bunch of JS code] </script>\n";
}
if ($isSpider) {
print "<!-- Dear search engine staff: we've suppressed the JavaScript code redirecting browsers to "erol.html", that's a frameset serving this page's contents more pleasant for human eyes. -->\n";
}
?>


Erol's HTML generator now puts <?php @include("x.php"); ?> instead of a whole bunch of JavaScript code.

The implementation for other environments is quite similar. If PHP is not available we can do it with SSI and PERL. On Windows we can tell IIS to process all .html extensions as ASP (App Mappings) and use an ASP include. That would give three versions of that patch which should help 99% of all Erol customers until they can upgrade to version 5.

This solution comes with two disadvantages. First, the cached page copies, clickable from the SERPs and toolbars, would render pretty ugly because they lack the JavaScript code. Second, perhaps automated tools searching for deceitful cloaking might red-flag the URLs for a human review. Hopefully the search engine executioner reading the comment in the source code will be fine with it and give it a go. If not, there's still the reinclusion request. I think store owners can live with that when they get their Google traffic back.

Rolling out the patch:

Erol thinks the above said makes sense and there is a chance of implementing it soon. While the developers are at work, please provide feedback if you think we didn't interpret Google's Webmaster Guidelines strict enough. Keep in mind that this is an interim solution and that the new version will handle things more standardized. Thanks.




Paid-Links-Disclosure: I do this pro bono job for the sake of the suffering store owners. Hence the links pointing to Erol and Erol's customers are not nofollow'ed. Not that I'd nofollow them otherwise ;)

Labels: , , , , ,

Share this post at StumbleUpon
Stumble It!
    Share this post at del.icio.us
Post it to
del.icio.us
 


-->

2 Comments:

  • At Tuesday, April 24, 2007, Anonymous Anonymous said…

    Nice analysis, thanks. And another case of UA cloaking required to get along with search engines, and search engines requiring human judgement to check on quality. And the lesson, of course, re-learned: frames are bad. Which means AJAX is bad. Or not. Hoo boy.

     
  • At Tuesday, April 24, 2007, Blogger Sebastian said…

    Thanks for the compliment :)
    As for frames, well that's an evil technology, so use equals abuse. AJAX on the other hand is a pretty cool technology, very useful, but easy to abuse.

     

Post a Comment

<< Home