Friday, May 25, 2007

Searchable Ajax with JavaScript controllers and a headless Gecko

I'm back from a much appreciated vacation, and just wanted to jot down some thoughts that I want to expand on later. This is a continuation of how I want to build web applications on the RIA-end of the web scale, but an approach I could use for just about any data-driven web site.

I only want to use the server to serve files and to implement data storage. There should be no server-side work dealing with the UI controller. This means the "application server" part of the server (the PHP, Java, .Net, RoR part) should only be handling the API to save/modify/get the data. It should not be doing any presentation or UI flow. JavaScript, CSS and HTML should be used for the presentation and UI flow.

So, the app server implements the data API as a REST API. This gives a good boundary of what should be done in the app server. The data should use XML (preferrably ATOM) or JSON as the data markup. The only presentation part might be to apply something like an XSLT transform of the data if the GET request matches a browser or searchbot user agent. Alternatively, use the domain name of the GET request to know whether or not to apply the XSLT transform (sometimes it is nice to host the pure data API on a different domain for security and load reasons).

Progressive Enhancement should be used for adding the JavaScript controller to the page, so that hyperlinks to other resources can be followed by searchbots or non-JS enabled browsers. Editing of the resources via an HTML interface could be restricted to JavaScript-enabled browsers (mostly where a UI controller is needed anyway, and search bots only care about indexing GET resources anyway).

So that is a baseline. This baseline now cleanly separates the UI controller from the server side, and makes it less relevant what sort of app server technology you use. The UI controller is kept in the JavaScript realm, no more splitting it between JavaScript and [Struts, RoR, Django, etc...].

Now it is time to jump the shark:

Restrict the data API to its own domain, separate from the domain used for presenting the UI for the browser/search bots. For example, use api.domain.com for the data API and ui.domain.com for the browser/search bot interface.

The HTML files on ui.domain.com use Ajax-like techniques access the resources on api.domain.com to show the UI. The Ajax-based data loading would have to be a cross-domain ajax. If the API and UI domains share a common sub-domain, document.domain tricks could be used. Otherwise (and more interesting to me personally), JSONP or XMLHttpRequest IFrame Proxy could be used.

A nice benefit of this approach is that all of the UI pieces (HTML, JavaScript and CSS) are highly cacheable and servable from very different domains than the API domain.

So what about search bots? They cannot do the Ajax techniques, at least not yet. For them, in the web server config, route their user agents to a web server module that uses Gecko to fetch the page, and do the ajax loading. The HTML can send an event to Gecko (this custom, headless gecko can export a function, something like window.onAjaxFinished()) to tell it when it is done rendering. Then the Gecko module serializes the current state of the DOM and sends that HTML string back to the search bot. It could even keep a local cache if it liked, if the resources did not change that often.

This approach would allow the search bot to get a good representation of the page for their indexing, and since Progressive Enhancement was used to bind to resource hyperlinks in the document, the search bot can still navigate to other resources on the domain (and those requests would be funneled through the web server's headless Gecko module).

In this model, REST via HTTP becomes the new JDBC. And more importantly to someone like me who enjoys the front-end -- I don't need to learn about any of the app server flavors of the month to implement data access (the other app server-hugging developers on my team can do that).

So now I just need to do a custom build of a headless Gecko that I can use in a web server module. Any pointers are appreciated. Ideally someone should do this as an open source project so everyone can benefit.