Blog
Posted Sunday, November 01, 2009 06:59:05 PM by dfe

The other week I decided to write some blogging software and came up with WOBlog. WOBlog is rather different from most WebObjects apps as it actually has "friendly" URLs. And I don't mean friendly in the sense that it is using the .../wa/foo/bar/ direct actions but friendly in the sense that if I didn't tell you and you didn't look at the HTTP headers you'd probably never know that I'm using WebObjects.

WebObjects is vastly different from standard web server application software. In a typical IIS/ASP.NET, Apache/PHP, or Apache/PERL configuration the webserver software is responsible for resolving the URLs to filesystem paths. Before we detail how WebObjects handles URLs we should first cover how an ordinary webserver does it.

Static Content in a Document Root

The simplest web server program is one that resolves an incoming URL to a file path on disk then sends the client the exact contents of that file. So if the client requests http://example.com/foobar.html the server looks for /var/www/html/foobar.html and returns its contents. There are still specialty webservers that do only this but the vast majority of them have several facilities for causing other code on the server to be executed in response to an HTTP request.

CGI in a Document Root

The easiest to understand example of dynamic content is a CGI script. The webserver sees http://example.com/foobar.cgi and as before resolves this to a file on disk /var/www/html/foobar.cgi. Instead of serving the file as-is the webserver software is configured to execute any files with the extension .cgi by simply running them as programs. On a UNIX system the extensions don't matter to the OS so the foobar.cgi program can be anything. It might be compiled C or C++ or even FORTRAN (I'm sure that someone has done this, if only as a joke). On many early websites, executing CGI was nearly synonymous with executing a PERL script.

On many webservers you could include additional components in the URL such as http://example.com/foobar.cgi/baz. Because a file on the filesystem cannot contain subfiles there is no ambiguity; the "baz" portion of that URL cannot refer to anything on the filesystem. With this scheme one could have a whole bunch of purely dynamic content all served behind the foobar.cgi program. That said, having periods in directory names usually looks odd to most people. Also having the webserver configured to directly execute any content with the extension .cgi and the executable flag set is seen by some administrators as a security risk because anyone with write access to the web site's document root could put any arbitrary program in the directory and cause it to be executed in the context of the server.

CGI in its own directory

In response to these problems one alternate solution is to treat the web's document root (located at say /var/www/html/ on the filesystem) as containing only pure content. Then an additional directory such as /var/www/cgi-bin/ would be added to the webserver's configuration such that any URL's beginning with /cgi-bin/ would go to that directory instead. This has the effect then of hiding /var/www/html/cgi-bin/ since everything is redirected to the other folder. In this scheme nothing in /var/www/html/ will ever be executed by the webserver. Thus the administrator can safely give any number of users access to change the static content portions of the site but only give trusted administrators rights to put new cgi-bin programs on the site.

As an added bonus, every file in cgi-bin that is executable by the system is considered executable by the webserver. So we can now have /cgi-bin/foobar instead of /foobar.cgi. Also, as above with http://example.com/foobar.cgi/baz we can have http://example.com/cgi-bin/foobar/baz and the /cgi-bin/foobar file will still be executed. This scheme offers a clue as to why WebObjects uses /cgi-bin/WebObjects/ at the start of its URLs.

Web Server Modules

At some point the fork/exec of a new process needed to run the script becomes a fair amount of overhead for the server so CGI has been mostly replaced by in-process modules. The most common of these for Apache is mod_php although there are many more. In this case the Apache module contains the interpreter code in an initialized state. The Apache server software is instructed to send any requests for a .php file to the built-in mod_php handler. The Apache software is still responsible for resolving the request URL to an on-disk file and passing the request URL, other server information, and the filename to the (now built in) interpreter.

In addition, because the mod_php (or other) interpreter stays loaded into the webserver process it becomes possible to maintain application state in process memory. With traditional CGI one would have to write the state to disk, database, or some other form of persistent storage. Depending on module the persistence of state can be confined per site or even to particular directories within a site. So http://example.com/*.php and http://example.com/foo/*.php might share in-process data but http://example.com/bar/*.php might share a different set of state.

Windows with IIS and ASP.NET has a similar setup. The web server software is configured to use an ASP.NET application to handle any files (note: files, not URLs) with an aspx extension. The administrator then configures IIS such that all .aspx files for the http://example.com/ site are to be run by the ASP.NET application contained in that site's document root. One can also configure a subapplication (sometimes known as subweb) such that http://example.com/bar/ goes to a different folder with a different ASP.NET application.

In any case, URLs which aren't one of the registered extensions (.aspx, .ashx, and others) do not get sent to ASP.NET. So all other content on the site like .jpg, .png, .js, and .css files is served directly by IIS. Also, IIS takes care to deny requests for any .aspx.vb or .aspx.cs files, the web.config file, and a few other files. Often times the web.config file and/or the code will contain passwords for database servers so this is generally a good thing.

It should be noted that Global.asax in the root directory of the configured application can contain an override method to alter the processing of the request. But for the request to make it there in the first place IIS must have already known to send it to the ASP.NET application which generally means there must be an .aspx file on disk in the appropriate place.

Decoupling the URL from the implementation file

In recent years it is considered somewhat uncouth to expose the implementation details of your site to the user in the form of direct file paths to the .aspx or .php files. It is much nicer for the user if he can see http://example.com/people/johndoe instead of http://example.com/LookupPerson.aspx?name=johndoe.

Enter URL rewriting. Rewriting is the process of turning something like http://example.com/tags/foobar into http://example.com/showtag.aspx?tagname=foobar. URL Rewriting generally occurs as a function of the webserver code. For Apache this is mod_rewrite which uses rules in .htaccess files or in certain places in the httpd.conf file. There are also modules for IIS which, like mod_rewrite, look for regular expressions in web.config and rewrite the URL to something that can be resolved to a file on disk.

The main drawback, in this author's opinion, is that you still must have some .aspx file or .php file on disk. The rewriting occurs inside of IIS or Apache and ultimately results in it doing the same old dance of loading and executing a file or telling the in-process handler to process the file. There is no opportunity to execute other code before the request reaches the page because for the request to reach code it must first reach a page.

URL handlers within the application

Things are different with IIS 7 and its "integrated pipeline" where the ASP.NET code gets a shot at the URL early. But the integrated pipeline is only a first step towards decoupling URLs from the filesystem. The MVC feature actually begins to allow ASP.NET to load an arbitrary page based on an arbitrary URL. But like URL rewriting you have to register which particular URLs (by regex) will show which particular pages. Unlike URL rewriting you do get an opportunity to run some code and decide to return something else besides the page. But you cannot decide to return some arbitrarily different page.

In part 2 we'll explore how WebObjects handles URLs.