Ease of Installation: DokuWiki, PHP, files

In the past I’ve installed MediaWiki, ruwiki, git-wiki, and several other Wiki implementations (Perl, and Java implementations), with varying degrees of effort. For example, ruwiki required considerable gymnastics to get the right Ruby libraries in place on the machine I hosted it on, MediaWiki required a database, etc. Ruby libraries, databases, JVMs, and the like are all at the top of my toolbox so in most cases it’s just a few minutes and a few commands, which seems amply easy until compared wtih…

Yesterday I set up a DokuWiki instance (which stlruby.org might migrate to), and found that its underpinnings (PHP, plain text files) make for ridiculously easy installation:

  • wget
  • tar xzf
  • browse to install page, set a few settings
  • delete install page

Yet those underpinnings are very well suited to the task at hand. A typical Wiki does not need a database underneath it. As with many things, this reminded me of a general principle: use the least complex, most readily and commonly available, easiest to administrate technology appropriate for the task at hand.

Help! My Hierarchy is Slow – Faster Hierarchies with Nested Sets

A great many applications, including many that I’ve worked on, have a hierarchy of things: of parts, of people, of organizations, etc. The way most of us represent such hierarchy is with the first thing that generally comes to mind: make each Widget have a parent Widget, with a table like so:

create table widget (widget_id int, parent_widget_id int, other_fields_here);

This representation is called an “adjacency list”, and is simple and easy. You can readily build a tool to manipulate a hierarchy stored this way. Many off the shelf visual components, for both client-side and web applications, know how to manipulate hierarchies represented this way. Some reporting tools know how to report on hierarchies represented this way.

However, for answering common questions like “who all is under person X in the hierarchy”, the adjacency list approach is unwieldy and slow.

There are various other approaches to representing a hierarchy, most of them discussed in detail in Joe Celko’s articles and books, prominently in the book Trees and Hierarchies in SQL for Smarties. If you work with SQL and hierarchies, buy this book now.

One approach Celko is especially fond of is the “nested set” representation. You can read about it online here and here.

Of course, changing an entire application to use nested sets might be a very big deal in a mature application. That’s OK; in most cases we can get much of the benefit by building a nested-sets “cache” of the share of the hierarchy, with a table like so:

create table widget_hier_cache (widget_id int, left int, right int);

Each time the hierarchy has changed, or before each time we need to run complex queries, delete the rows in this cache table and repopulate them based on the current canonical adjacency data. Celko offers SQL code in his book to do that, which could be translated to work in the stored procedure language of the DBMS at hand. But what about DBMSs that don’t offer stored procedures, such as lightweight local databases (SQLite), MySQL, etc.? The translation must be done in application code instead.

I wrote such code in Delphi a while back, in the process of getting a full understanding of this problem; I’ve cleaned it up and now offer code for download here (DelphiAdjacencyNestedSets.zip), under an open source license (MIT license – use it all you want). I tested this today with Delphi 2007 Win32, but it should work fine at least back to Delphi 7. As far as I can tell with some searching, this is the only Delphi code for translating adjacency to nested sets available on the internet. This code doesn’t know about databases – it is a module to which you feed adjacency data, and from which you get back nested set data. It includes DUnit test cases.

I’ve also put the code on github, for easy browsing and forking.

(Update in August 2007: a new version (DelphiAdjacencyNestedSets2.zip) optionally tolerates “orphan” nodes and forests. Update in January 2008: a newer version (DelphiAdjacencyNestedSets3.zip) propagates an integer value down the hierarchy; it is named Tag as a nod to the .Tag property on VCL components. Both of these newer versions were tested with Delphi 2006/2007 also.)

The essence of the translation is a depth-first traversal of the hierarchy, and of course this can be easily implemented in other languages; the Delphi code is easy to understand, so don’t be afraid to take a look even if you need some Java or C# or PHP etc. I also stumbled across this PHP nested sets implementation, which offers a set of functions to maintain (insert, update, etc.) a hierarchy stored as nested sets, rather than only translate from adjacency to nested sets.

Another useful way to represent a hierarchy for fast querying is with a transitive closure table. I’ll write this up in a future post; it turns out to be especially useful (and necessary) to make arbitrary hierarchies work in the Mondrian OLAP server.

A/B Technique for Web Application Deployment

This description of my “A/B technique for web application deployment” was transcribed from audio, so it less tight, more verbose than my normal prose. I chose to post it in rough form, rather than leave it on the “back burner” until an unknown future date when I have time to rewrite it. I first explained this to a colleague around 1999, 8 years is long enough for an idea to wait.

The Problem / Context

At least a dozen times over the last decade, this scenario has come up at consulting client sites: you have a web application and you want to upgrade it with a new version. You could do so with a brute force cutover (stop the app, swap the code), but that’s not the scenario that I’m talking about. I’m talking about the upgrading to a new version, not quite compatible with the old one, without dropping current active users. For example, in the new version, you might have some different data that goes in the session. You might have some different pages, so you have some different URLs, you may be adding a new field so that once you put in this new version, there’ll be an additional field, and it stores that additional field in the session and in the database and the caching and so on. Yet you want the current users to keep working without interruption.

Solution

This technique is not language-specific – it applies equally in PHP, ASP, ASP.NET, Java servlets, JSP, CGI, etc.; with nearly any application or infrastructure.

Have the URL of your application, which I’ll call “/contact” here, be the URL of a proxy application (or “launching pad”). Then have two additional URLs for two specific instances of the actual application. For example, you might have “/contact” as your overall application URL and then “/contact/contactA” or “/contact/A” as one instance where you have that application installed.

At the “/contact” URL, install a simple proxy application, whose job is to take a newly arriving user, present them an intro/login screen, then redirect them to one of the specific instances of the application.

As a user I will point my browser to the “/contact” URL, and I bookmark that. I launch that, I see a login screen, I type in my name and password, I press the button to log in. The “/contact” launching pad redirects me to “/contact/A,”, an instance of real application. I’ll call the second instance “B”, perhaps at the URL “/contact/B”. In the normal steady state of the system, the user will be using A, they login to “/contact” and they end up in the “A” instance.

Then you want to do an upgrade, install a new version. Install the new version as “/contact/B.” Leave the existing application in place and working. (By the way, I’ve assumed you are using a technology where you can deploy and undeploy applications without bringing your web server down, but with mod_proxy you could make this work even without that capability) Deploy the new application version in a new and different path than what your current running users are on. Adjust some setting your proxy/launchpad application (perhaps as simple as a single line in a single config file). So, for example, you might install the new version as “/contact/B” and then you flip a switch (edit the config file) to make it so that new users that come to the “/contact” page don’t land in “/contact/A” anymore, they land in “/contact/B” as they login.

The current users already using the application in “/contact/A” stay there – they don’t know or care that you’ve deployed a new version. New users come in the come in the new version. So you want to have some sort of mechanism (likely provided by your application server if you’re using one, and not hard to build otherwise) for monitoring how many users are using each of these applications. So you might for example notice that you have 1000 users active on /contact/A. You deploy a new version as /contact/B and flip the switch. Then, depending on the usage characteristics of your application – over the next few minutes, next few hours, however it works out, the users, as they log out and log in and such, gradually all make it into the /B application. Some kind of maximum-login-time mechanism will ensure that this cutover happens in finite time.

Once the users have moved to /contact/B, you then declare it as your “current” version, and you take down /A, because no one’s using it and no one can get into it. So that next time you need to do an upgrade, you just do it in reverse – you deploy that new version as /contact/A, flip the switch back to make all new users’ logins land in the A… and again, after however many hours or minutes or whatever, you have all your users on the new version, and you can take down the old version.

You can easily implement with just the tools that already come with your Web application development system. You don’t need any kind of special hardware or special application server or HTTP server support. You don’t need any sort of special way of doing session affinity; you’re doing session affinity by simply handing out the URL of one of these other Web applications.

Bookmarks

Someone might bookmark a page of your application. So let’s say that you had directed them to /contact/A, and they were on the page /contact/A/lists.jsp. When they return to this bookmark later (you do want to support bookmarking, right?), you don’t want them to land there; you don’t want them to end up in the A application if you’re currently using the B application. This is actually pretty easy to handle also. You simply use some settings on your Web server to do a redirect, with a few lines of configuration in .htaccess or analogous mechanims. So based on your setting of which one is current, you make it so that if someone comes into the application without having a referrer from inside the application, you just redirect them over to whatever the current instance is. And that takes a little bit of thought, but only a little, and you can make it seamlessly solve that problem of users’ bookmarks working in spite of you switching back and forth between two instances.

Clustering

You might be deployed on a cluster. Perhaps you are using Websphere with 37 web servers. It turns out that this A/B approach works orthogonally to the clustering features of your Web application server. You could have the A application deployed across all 37 servers; you could deploy the B application, with a few clicks, across all 37 servers; you could flip that switch in some global way to kick people onto the B, and so on.

Override the launchpad for testing

You can permit users to enter a special URL to get to the “other side”. you could have some way of entering a URL that takes you past that launch application straight into the B side, so that you could click around, you could manually verify that the new B application works in the production environment before you flip the switch to make that the deployed production system. This is a very wise and useful type of testing to do, a great final stage of testing because the new code in actually in production. It’s obviously not a replacement for testing in a separate test environment, it’s an adjunct for even greater safety in deploment.

Performance

When a system is running in a steady state, its caches are fully populated with relevant data, so many requests can be answered with data from the cache (RAM). But when a system is freshly started, its caches are empty, so more requests require (slower) disk access, during the first few minutes of operations. This is sometimes called the “empty cache” problem, and is responsible for the poor performance sometimes seen in the first few minutes after a busy system is restarted.

The technique described here prevents this problem, because with it you avoid ever shutting down and restarting your whole Web application with your full user population on it. Instead, since the switch only brings newly-logging-in users to the new version – the new instance – you gradually have people start using it, so you never take a big hit all at once in terms of cache population.

Schema Changes

Hani asked, in a comment, about schema changes. A simple answer is that you won’t be able to make a transition like the one described here (where both the old and new code versions run in parallel for a while), if you make schema changes such that the old code no longer works. A more complex answer, which I have used with great results, is that this is a programmable computer and you are a programmer – with some effort, you can make the software tolerate both the old and new schemas. So the process works like this:

  1. Decide on the schema change, but don’t deploy it
  2. Modify your software to tolerate the old or new schema, whichever is present
  3. Deploy the new software, transition all users to it (as described above)
  4. Make the schema change; you may need to momentarily quiesce the software, but hopefully not kill user sessions

(There are a few tools out there to help with the schema-change-in-a-live-app problem. One of then is ChronicDB who wrote me to point this out.)

Of course this is a lot more work than just stopping the server, making your change, and restarting. Whether it’s worth it depends on your situation. If you have an overnight non-usage window, consider using it instead of the long path described here.

I hope this is helpful for someone out there. Comments are welcome.

Joel, you have got to be kidding

Joel seems to “play it safe” … then goes off the deep end of irony in his final paragraph:

“FogBugz is written in Wasabi, a very advanced, functional-programming dialect of Basic with closures and lambdas and Rails-like active records that can be compiled down to VBScript, JavaScript, PHP4 or PHP5. Wasabi is a private, in-house language written by one of our best developers that is optimized specifically for developing FogBugz; the Wasabi compiler itself is written in C#.”

Fortunately DHH saved me some minutes of typing about it, with a scathing commentary.

Over at Oasis Digital we use both common tools (.NET, Java, PHP, C, Delphi, etc.) and more unusual ones (Lua, Prolog, Ruby, sorry no Lisp yet), so I believe that puts us in the DHH and Paul Graham camp: If you want to win, you must be willing to do something different from the pack… such as, in an extreme case, creating your own language optimized for the task at hand, whether in the form of Lisp macros or a C# compiler for Wasabi.

Comparing PHP and ASP

PHP and ASP (Active Server Pages) have many ideas in common, and I tend to lump both in to the same category: page-based server-side web scripting tools. I use that category for small to medium sized application; I usually don’t choose it for large or complex web applications.

Here are a few points of comparison between the two:

  • PHP is more seamless across platforms. I can develop on my Windows box and deploy on Linux. (I know about ChiliSoft’s ASP implementation for Unix, but I don’t see much point in bringing my ASP code, which tends to use COM objects, over to Linux.)
  • PHP has a great number of functions that a web developer needs “in the box” – read through the manual to see the list. I found that I had to go looking around to third parties and COM objects to get some things in ASP which
    were already there in PHP. For example, the htmlentities() function.
  • PHP is very often available on external web hosts, if they use Unix/Linux servers, at very low cost.
  • ASP is not tied to one language. It uses the same infrastructure to work with VBScript, JScript, PerlScript, PyhonScript, and more! This is an excellent approach, and Microsoft should be applauded for it.
  • In ASP, the dominant model for DB access is to use ADO, which provides a degree of database independence; PHP has seperate functions for each database type, and requires an additional layer to acheive that degree of independence.
  • Code modules in a COM object can be plugged in to ASP or 100 other places.
    Code modules in C with appropriate PHP linkage are only for PHP, and are potentially more difficult to write.
  • ASP is almost always available on external web hosts, if they use Windows servers. Windows-based hosting tends to cost a bit more than Linux-based hosting, especially with SQL Server is provided.

Update (2007): PHP and ASP (now ASP.NET) have changed considerably since I wrote this; I suggest looking around for a more current comparison if you need to select between them now.

Under The Hood – PHP and MySQL

(This introductory article was written in 2001 to help explain to clients why LAMP (Linux, Apache, PHP and MySQL) were chosen as the infrastructure for for certain kinds of web sites. We generally choose other tools now.)


Many of our dynamic web sites are built using PHP and MySQL. Although these products are frequently used together, they do not have to be, and each plays a separate role in the dynamic web content generation process. The two major components needed for that process are:

  1. A database system to hold the dynamic web site’s data. Although it is possible to build small sites on flat-file storage systems, a database server is a much more scalable, reliable foundation for a dynamic web site. The database server is not web-specific per se, although it does need to potentially support a heavy load of queries, slanted heavily towards “SELECT” (data reading) queries rather than data updating queries.
  2. A web-database integration tool (scripting language or equivalent) for dynamically generating HTML pages, email messages, etc. based on data stored in the database, and processing user interactions to update that database.

Often for the database, we choose MySQL for several reasons:

  • Reliability – based on reading comments from many users, MySQL appears to be a reliable product. We have experienced no MySQL failures of any kind ourselves.
  • Price – MySQL is free, keeping the budget to a minimum for smaller sites.
  • Speed – MySQL is designed to operate very quickly and efficiently for simple SELECT statements, at the expense of lacking some features that “heavier” databases have, such as transaction support, stored procedure, correlated subqueries, etc. Although these are very valuable features, most dynamic web sites do not need these features because of the nature of the features they provide.

We are also familiar with a number of other database choices for other projects:

  • Microsoft SQL Server – this product, in it’s 7.0 and higher versions, is a very nice database server with excellent integration (of course) in to the Windows NT environment. It would be an obvious choice to consider when working on a project which was required to be NT-based.
  • Oracle – Oracle is the market leading for large database servers, and runs behind the scenes and many of the internet’s busiest web sites. It has a mildly painful but extremely powerful stored procedure capability, and can be tuned for high performance under extreme load, if you can find a sufficiently skilled DBA. We are happy to bulid an Oracle-based solution where it is appropriate. Oracle is however quite expensive and require much more hardware to run well than MySQL, for example.
  • PostgreSQL – This is the “other” free database server out there. It has only recently begun to approach the stability and speed of MySQL, and it offer a much richer set of features, more similar to the top-tier commercial database servers.
  • Interbase – Released as one point as open source from Borland, Interbase is also very fully featured. It seems to not be as fast as MySQL for simple SELECTs (the most common case in web apps), but can handle complex tasks effectively and does not require much DBA attention.
  • Various others.

For our programming / web-database integration tool, we are using PHP, a web-specific scripting language. There are literally hundreds of web-database tools to choose from, so not surprisingly there are many excellent choices. Some of the reasons we use PHP are:

  • Tight integration with Apache, the most popular web server on the internet.
  • Good performance, because of that tight integration. PHP is an excellent compromise between power and efficiency; it provide a flexible, expressive language yet it is simple enough to teach to new team member quickly
  • A very rich array of built-in features. In particular, PHP has many capabilities “in the box” that have to be added on to other popular solutions like ASP.
  • Built in database access, with connection pooling.
  • It is Open Source, and under very active development; any bugs that appear get fixed quickly.
  • There are commercial performance enhancement mechanisms available if needed.

PHP is not without weaknesses, however. For example, its basic design encourages mixing the scripts in with the site’s HTML. This works well for small sites, but becomes very troublesome on large site where different people are responsible for the HTML and the programming. Another weakness is that the language relies mostly on programming technique to remain manageable and structured; it has almost no support for modularization. Some other web-database integration tools that address these issues and others are:

  • Java Servlets, which leverage the increasingly dominant Java language.
  • a Java-based application server, such as WebLogic, WebSphere, etc. These can be used to build systems which include servlets, JSPs,EJB, JMS, and other powerful Java technologies.
  • mod_perl integrates the remarkably expressive Perl language into Apache, and the wide array of accompanying Perl modules can implement isolation between the application code and HTML if desired.
  • Zope, an application server based on the Python language, has a strong following and provide the desired seperation between logic and HTML. It also offers total web-based management, scalability, and a vibrant developer community.

We are pleased and satisfied with the usability and performance of PHP and MySQL. They are an excellent combination for small to medium-scale read-intensive web application projects.