blog.christoffer.me

Hi, my name is Christoffer and I am an Internet addict.

2013-04-20 17:28

How to make your Single Page Application Node.js website using URL hash routing more search engine friendly

I love Single Page Applications. Even though I know there are some flaws with them (such as the slightly increased performance which made Twitter recently rollback their solution), I really like that it enables the developer to create really fluid user friendly websites.

One of the more obvious challenge with SPAs are that they are not really search engine optimized. Meaning, since your website’s content is most likely generate or added to the site on the fly with JavaScript, search engines have problems crawling and extracting information from it (since search engine crawlers don't usually execute the JavaScript when fetching a site’s contents).

However Google themselves have come out with some advice concerning this problem. One of their advice is using a snapshot technique, which I am going to briefly demonstrate in this guide.

But let's start from the beginning.

My Single Page Applications website that generates the content by JavaScript, hence not currently very search engine friendly.

The Node.js webserver:

var express = require( "express" );

var app = express();

app.use( express.static( __dirname + '/public' ) );

app.listen( 8080 );

console.log( "Webserver started." );

and my single index.html file:

<!doctype html>

<html>
   <head>
       <script src="http://code.jquery.com/jquery-1.9.1.min.js"></script>
       <script type="text/javascript">

           function checkURL() {

               var myRegexp = /\/user\/(\w+)/;
               var match = myRegexp.exec(document.URL);

               if( match !== null ) {

                   $("body").html( "<p>" + match[1] + " has two cute cats!<p>" );

               }

           }

           $(document).ready(function () {

               checkURL();

           });

       </script>

   </head>

   <body onhashchange="checkURL();">

       <p><a href="/#!/user/john/">You should really visit John's page.</a></p>

   </body>
</html>

So basically here I have done a Single Page Application. If you visit http://localhost:8080/ you will see a simple page with a link on it - but if you visit http://localhost:8080/#/user/john/ you will learn that John has two cute cats.

The obvious problem here is that when Google crawls the url http://localhost:8080/#/user/john/ they will not learn that John has two cute cats, since that content was generated by JavaScript. So now that the problem is identified, how do we solve it?

Step 1 - adding the ! exclamation mark character

As suggested by Google, we should add an ! exclamation mark character next to our hash character, making it into the escaped fragment sequence.

So in our HTML page, we change the link "You should really visit my John's page." so that it now points to:

http://localhost:8080/#!/user/john/

The reason why we add this is because when Google finds links with #! they will convert that into _escaped_fragment_ when crawling the website. Basically meaning that the Google bot will fetch the contents of this URL instead:

http://localhost:8080/_escape_fragment_/user/john/

However, if we try this new URL we will get a 404 Not Found error since our Node webserver is only serving our index.html at the moment. We need to fix that.

Step 2 - Capturing the Google bot requests

Now we have to create a special support for the requests performed by the Google bot. We do this by setting up a new mapping and add a special handler for these requests:

app.get( "/_escape_fragment_/*", function( request, response ) {

    response.writeHead( 200,
        {
            "Content-Type": "text/html; charset=UTF-8"
        } );

    response.end( "Hello Google bot!" );

} );

This would give the Google bot a "Hello Google bot!" greeting when visiting this url:

http://localhost:8080/#!/user/john/

Step 3 - Creating the snapshots

In order to tell Google the actual contents of the URL (after JavaScript has generated the content that is) we need to take a snapshot of the site and provide that instead (through our newly implemented request handler).

To achieve this we will use a headless browser, such as PhantomJS using the Node module phantomjs Node.

This is my Node PhantomJS script (the script provided to PhantomJS with instructions on what to do):

var system = require( "system" );

var page = require( "webpage" ).create();

var url = system.args[1];

page.open( url, function( status ) {

   var pageContent = page.evaluate( function() {

       return document.getElementsByTagName( "html" )[0].innerHTML;

   } );

   console.log( pageContent );

   phantom.exit();

} );

And here is my new Node webserver:

var express = require( "express" );
var path = require( "path" );
var childProcess = require( "child_process" );
var phantomjs = require( "phantomjs" );
var binPath = phantomjs.path;
var app = express();

app.use( express.static( __dirname + "/public" ) );

app.listen( 8080 );

app.get( "/_escape_fragment_/*", function( request, response ) {

    var script = path.join( __dirname, "get_html.js" );

    var url = "http://localhost:8080" + request.url.replace( "_escape_fragment_", "#!" );

    var childArgs =
    [
        script, url
    ];

    childProcess.execFile( binPath, childArgs, function( err, stdout, stderr ) {

        response.writeHead( 200, {
            "Content-Type": "text/html; charset=UTF-8"
        } );

        response.end( "<!doctype html><html>" + stdout + "</html>" );

    } );

} );

console.log( "Webserver started." );

Wrapping everything together will result in when Google bot now does a request to http://localhost:8080/#!/user/john/, PhantomJS will create a snapshop of the real url and deliver that to the search engine.

Future performance improvements

Please note that this example above is not really performance friendly, as it actually will do an own request, for each search engine request that comes in. There is plenty of room to increase performance, as caching the snapshots on disk or even in the memory, etc.

comments powered by Disqus

Search the site

About Christoffer

Christoffer is a software and web developer, with more than 15 years experience of the Java language, now focusing on becoming a JavaScript and Dart Ninja.

Always interested in learning new and exciting new technologies and solutions within software- and web-development and the Internet, while suffering from the classic "I can't stop thinking" syndrome.

Currently working as a JavaScript Web Developer at QlikTech, while working on his own ideas and projects via his own software company during his spare time.

Please keep in mind that any opinions expressed here are Christoffer's own opinions and does not necessarily reflect those of his employer, or any other companies, organizations, groups or individuals.

Previous posts