Tuesday, March 26, 2013

Backbone.JS and SEO: Google Ajax Crawling Scheme


Most search engines hate client side MVC, but luckily there's a few tools around to get your client side routes indexed.

As most web bots (i.e. Google and others) don't interpret javascript on the fly, they fail to parse   javascript rendered content for indexing. To overcome this, Google (and now Bing) support the 'Google Ajax Crawling Scheme' (https://developers.google.com/webmasters/ajax-crawling/docs/getting-started) - which basically states that IF you want js rendered DOM content to be indexed (i.e. rendering ajax call results), you must be able to:
  1. Trigger a page state (javascript rendering) via the url using hashbangs #! (i.e.http://www.mysite.com/#!my-state), and
  2. Be able to serve a rendered dom snapshot of your site AFTER javascript modification on request.
If using a client side MVC framework like Backbone.js, or simply have a javascript heavy page, and wish to get its various states indexed - you will need to provide this dom snapshotting service server side if you want your web app indexed. Typically this is done using a headless browser (i.e. QT, PhantomJS, Zombie.JS, HtmlUnit).

For those using ruby server side, there's a gem which already handles this called google_ajax_crawler available on rubygems.

gem install google_ajax_crawler


Its used as rack middleware and essentially intercepts a request made by a web bot adhering to the scheme, and scrapes your site server side then delivers the rendered dom back to the requesting bot as a snapshot.

A simple rack app example demonstrating how to configure and use the gem:

#
# to run:
# $ rackup config.ru -p 3000
# open browser to http://localhost:3000/#!test
#
require 'bundler/setup'
require './lib/google_ajax_crawler'
use GoogleAjaxCrawler::Crawler do |config|
config.driver = GoogleAjaxCrawler::Drivers::CapybaraWebkit
config.poll_interval = 0.25 # how often to check if the page has loaded
#
# for the demo - the page is considered loaded when the loading mask has been removed from the DOM
# this could evaluate something like $.active == 0 to ensure no jquery ajax calls are pending
#
config.page_loaded_test = lambda {|driver| driver.page.evaluate_script('document.getElementById("loading") == null') }
end
# a sample page using #! url fragments to seed page state
page_content = File.read('./page.html')
run lambda {|env| [200, { 'Content-Type' => 'text/html' }, [page_content]] }
view raw config.ru hosted with ❤ by GitHub
<!--
To view this page's markup as a search engine would see it without the google_ajax_crawler gem, open in a browser and
view source...
To see how the google_ajax_crawler gem delivers a rendered snapshot of the page, open /?_escaped_fragment_=test
-->
<html>
<head></head>
<body>
<h1>A Simple State Test</h1>
<!-- the url fragment (i.e. /#!something) will be rendered via JS in the span-->
<p>State: <span id='page_state'></span></p>
<!-- will be removed by js on page load -->
<div class='loading' id='loading'>Loading....</div>
<script type='text/javascript'>
var init = function() {
var writeHash = function() {
document.getElementById('page_state').innerHTML = "Javascript rendering complete for client-side route " + document.location.hash;
var loadingMask = document.getElementById('loading');
if(loadingMask) loadingMask.parentNode.removeChild(loadingMask);
console.log('done...');
};
window.addEventListener("hashchange", writeHash, false);
setTimeout(writeHash, 500);
};
//
// Only execute js if loading the page using an unescaped url
//
if(/#.*$/.test(document.location.href)) init();
</script>
</body>
</html>
view raw page.html hosted with ❤ by GitHub