Map-Reduce in the Small: an Array of Talks

At Strange Loop 2010, Guy Steele gave a wide-ranging, excellent talk in which the key point was:

In essence, his notion is to use a divide-and-conquer approach, which he described as “map-reduce in the small” (or some similar phrase). This is analogous to techniques used to partition work in large distributed systems, but inside a single program.

I heartily agree with all of this. Massive multicore will be a dominant factor in software design in the coming decade. In 2010, most of us are happily waiting in a calm before a storm, because our multicore machines don’t have very many cores yet. For most applications, we get by with very coarse parallelism (such as one thread per concurrent user request being served, in an application server or web server). This won’t last – when cheap PCs have 50+ cores, most software will need to harness parallelism in a much more fine-grained way. Allocating only one core per concurrent operation will become ridiculous.

You can download Steele’s slides from the Strange Loop site, or watch this video of his previous talk at ICFP 2009 in which he covered some of the same material. At Strange Loop, Steele showed how to solve a particular problem (counting words in a string) in a manner amenable to parallel processing. His sample code was written in Fortress, a Sun-Oracle research language. Fortress didn’t bother me, but I heard some discussions about the language choice (and rapid presentation) as an obstacle to detailed understanding.

I propose to elaborate these ideas by walking through sample code, in a series of talks.

Talk Proposal: Map-Reduce in the Small

First, I will briefly summarize the need for fine-grained parallelism.

Then, I will present three code walkthrough examples in a widely used language.

  1. The word-count problem from Guy Steele’s Stange Loop talk.
  2. Another simple algorithmic / computer-science-flavored example.
  3. As time allows, a third example from a thoroughly practical, enterprise-app-flavored problem space.

This talk will use very few slides; instead it will be all about the code. Each example will show how (not just why) to parallelize algorithms.

An Array[0..N] of Talks

This idea is worthy of deep understanding and practice, so I have in mind several talks, each using a different language, and with sufficient differences to be interesting even for someone who happens to be present for all of them:

  1. Examples in Java, at the St. Louis Java user group
  2. Examples in Clojure, at the St. Louis Clojure Lunch Cljub (there are already some good Clojure examples out there, making this one easy, but still worthwhile).
  3. Examples in another language (perhaps more esoteric, to be determined later), at Lambda Lounge
  4. A repeat of one of those languages, with updated examples, at Strange Loop 2011 (no web site yet!)

Putting these together will take a while, so I have in mind spreading these over the next year or so. Of course, it is quite possible that only a subset of these groups/events, possibly an empty set, will accept this talk offer. In that case, I’ll take it as a sign that the St. Louis community already understands micro-parallelism in depth, and celebrate!

Update: Schedule

As these talks are scheduled, I will put the information here:

  1. Java code at the St. Louis JUG: Jan 13, 2011.

Refactoring some Factor code

Most of the software I work with is very practical. At Oasis Digital we mostly create line-of-business enterprise software, and even when I step away from that, I usually pick up a tool or language that has a good likelihood of mainsteam adoption.

Sometimes, though, I like to really stretch my mind. For that, it’s hard to beat Factor. Factor is fascinating in that it combines a goal of efficiency and practicality, with a syntax and computation model which are quite alien even to a software polyglot. Don’t let the stack-ness deceive you; it’s a big leap even if you’ve used FORTH and grown up with a HP RPN programmable calculator.

So I set about this evening to work with some Factor code, a simple GUI calculator posted a few days ago by John Benediktsson. I bit off an apparently small bit of work: remove the “code smell” of that global variable, and in the process, make it so multiple calcs each have their own model (rather than a global shared state).

Original version from John

My finished version

The two most important pieces of updated code:

The changes consist approximately of:

  • Change all the button words to accept a model input
  • Change the <row> word to accept a model and use map instead of output>array
  • Remove the calc variable
  • Change the calc-ui word to shuffle things around and use make rather than output>array

In case it isn’t obvious from my text above or the source code, I am not a Factor programmer, please do not use this as example code. On the other, I learned a bunch of little things about Factor, and perhaps implicitly about concatenative programming in general, in the process of making this work.

What is the Best Git GUI (Client) for Windows?

I adopted Git as my primary source control tool a couple of years ago, when I was using Windows as my primary (90%) desktop OS. Since then I’ve switched to 75% Mac OSX, but I still use Git on Windows for a few projects, and I get a lot of questions about Git on Windows.

I use msysgit (and its included GUI) most often myself, but I don’t have a clear answer as to which is the “best” Git GUI for Windows. I can offer this list of choices, though, along with some thoughts about them.

There is also a very long list of Git tools on the main Git wiki; but that page is just a list, without any other information.

msysgit

msysgit is the main project which ships a Windows port of Git. It is based on MSYS, so it fits in the Windows ecosystem a bit better than the cygwin Git port.

msysgit includes the same Tk-based GUI tools as Git on Linux: a commit tool and a repo-browse tool, plus a bit of shell integration to active the GUI by right-clicking in Windows Explorer, plus a new thing call git-cheetah, which appears to be heading toward Tortoise-style integration. These tools are a bit ugly, but have good and useful functionality. I don’t mind the ugly (I get my fix of stylish software over on my Mac…), and I find the features ample for most work.

If you don’t know where to start, or if you want a Linux-like Git experience, start with msysgit and learn to use its tools.

msysgit is free open source software. It is under active development, and keeps up with the upstream Git versions reasonably well. There is even a portable (zero-install) version available.

My biggest gripe with msysgit (and its GUI) is that I had to figure out how to use it effectively myself. I could have really used a video walkthrough of how to be productive with it, back when I was starting out. That was a long time ago for me, but might be Right Now for people reading this post. Mike Rowe (a reader) helpfully suggested this msysgit tour, which is very helpful though a bit dated.

TortoiseGit

This is an attempt to port TortoiseSVN to git, yielding TortoiseGit. If you like and use TortoiseSVN, you’ll probably find this worth a try. I haven’t tried it yet myself.

TortoiseGit is free open source software, and is under active development.

Git Extensions

This Git GUI has a shell extension (like the Tortoise family) and also a plugin for Visual Studio. From the screen shots, it appears to be feature-rich and complete.

Git Extensions is free open source software, and is under active development.

SmartGit

Unlike the other tools listed here, SmartGit is a commercial product (from a German company), starting at around $70. It appears to be more polished than the others, as is often the case with commercial products. It also appears to be quite feature-rich.

I don’t know how SmartGit fits in with the Git licensing; Git is licensed GPL (v2), so I assume (hope?) SmartGit has found some way to use it under the hood without linking to it in a way that would cause license trouble.

SmartGit requires a Java runtime, implying that it is written in Java. Five year ago I thought of that as a caveat; but today, Java-based GUIs can be extremely attractive and fast, so I don’t see as a problem at all.

Is IDE Integration Vital?

I know people who swear by their IDE experience, and are aghast at the thought of any daily-use dev tool that is not integrated with their IDE. It is almost as though for this group, multitasking does not exist, and any need to run more than one piece of software at the same time is a defect.

Now I love a good IDE as much as anyone (I’ve urged and coached many developers to move from an editor to an IDE), but I don’t agree with the notion that source control must always be in the IDE. IDE-integrated source control can be very useful, but there are sometimes cases where non-integrated source control wins.

The most common example for me is when using Eclipse on a large, complex system. There are two annoyances I see regularly:

  • Eclipse assumes that one Eclipse project is one source control project, an assumption that is sometimes helpful and sometimes painful. In the latter case, simply ditch the Eclipse integration, and use a whole workspace (N projects) as a single source-control project, outside of Eclipse.
  • Sometimes Eclipse source control integration bogs down performance. Turn it off, and things speed up.

Therefore, when I use Eclipse, I sometimes manage the files from outside, using msysgit, command line, etc. When I have a complex “real-life project” comprised of many Eclipse “projects”, I set up a separate Eclipse workspace for it, apart from other unrelated Eclipse projects.

Feedback Wanted

I’d love to hear about:

  • More Windows Git GUIs to list here
  • Anything else I’ve missed

.. via the contact page (link at the top of the page). I try to reply to all email within a few days.

Write your whole stack in JavaScript with Node.JS

Node is a combination of Google’s V8 JavaScript implementation, and various plumbing and libraries. The result is an unusual and clever server programming platform. Node is in a fairly early development phase, and already has a remarkably active community: ~9000 mailing list messages (as of June 2010) and many dozens of projects and libraries. I’ve spent some time digging through Node code and writing small bits of it, and was pleased with what I found.

Why is Node worthy of attention?

  • JavaScript is a Next Big Language, it is everywhere. It is probably the most widely used programming language ever.
  • I know a few things about asyncronous server programming, having done a lot of it in 1990s IVR software; it is very well suited to serving a large user population.
  • Node is accumulating libraries at an impressive rate, indicating momentum.
  • There are significant advantages in developing a whole application stack (server and client code) in a single language. For example, this makes code and business logic sharing works across tiers. Using Node, a JavaScript-HTML tool, a JavaScript-CSS tool, JSON, etc., it is possible to develop a complex web application using only JavaScript.

Node is not all unicorns and roses though.; my most serious misgiving about it is that it does not (yet) have a great strategy to make straightforward use of many-core servers. We’ll have to see how that develops over time.

Node Knockout

The team at Fortnight Labs is putting together Node Knockout, a 48-hour Node programming contest. I am a fan of such contests. I’ve offered to help out by being a judge, and I’ve also signed up Oasis Digital as a sponsor.

As a judge, I can’t be on a team; I’ve like to see a team or two form here in St. Louis, though.

I’m Dreaming of a Better Social Media Client

I’m not a big social media guy. I’m certaintly not a social media consultant, nor a maven. I never used MySpace at all, and I was not among the first to use Facebook, LinkedIn, Twitter, etc. But I do find all of those useful to keep in touch with a bunch of people using all of the above, and I’ve grown quite frustrated with the sorry state of the client applications I’ve tried. Even those whose features work well and look good, don’t really go after the core problem we all either have it or will hit: information overload.

Here is what I really want in a social media client application for “power users” who receive a lot on their feeds: follow a lot of people on Twitter, have a lot of friends on Facebook, 500+ on LinkedIn, etc. Today, these are power users. Over the next few of years, this will be “everybody”. Most of these features make a lot of sense for a business managing its presense.

Table Stakes – The Basics

Support the Big Three (Twitter, LinkedIn, Facebook)

… and hopefully several more. But don’t even come to the party without the big three. I’m looking at you, Twitterific on the iPad, which I otherwise enjoy (and use every day, and paid for).

Ideally, RSS feeds would also flow in, and perhaps email and SMS too. But I don’t want this to be a “unified inbox” to replace an email client; this information would appear here as context for smart reading.

Run On Many Platforms

Mac, PC, iPhone, iPad, Android, Linux, maybe even BlackBerry. It’s not necessary to start with all these, but the target should to end up with all of them and more, with the core features present everywhere. I’m not looking for crappy ports though. Native, good citizens.

Keep Track of What I’ve Seen

Keep track of what I’ve seen, automatically. Don’t show me again unless I ask. But the act of closing the app should be meaningless, in that it should not mark all data as seen. An example of what not to do is TweetDeck, which has various settings for this, of which I can’t find any combination that does the Right Thing.

Next, the less common ideas:

I Paid for a Lot of Pixels – Use Them

Single-column feed display GUIs? Great idea for a phone. Silly on a PC.

Like most PC users, I have a wide, high resolution screen. Like many power users, I have two screens on some computers. I payed good money for all these pixels because I want to use them. Therefore, when I’m trying to catch up with all these data/tweet/etc. feeds, I want software that makes good use of those pixels. Show me a rich, dense screenful of information at one. Make it look like a stock trader’s screen (or screens).

Our Eyes are All Different – Give Me Knobs

I don’t want extensive customization. I don’t want a whole slew of adjustments. I don’t want a Preferences dialog with 82 tabs. I don’t even want themes. I want a good, clean, default design… but with a few well-considered knobs. Perhaps something like so:

  • font/size knob – because my eyes might work a bit better or worse than yours, and my screen might be higher or lower resolution than yours.
  • information density knob – because sometimes I want to admire a beautiful well-spaced layout, and something I just want to pack more information on there.

Aggregate Across Networks

Many of the people I follow, post the same data to at least three social media outlets; then a bunch of other copy/paste or retweet it. Please stop showing me all that duplication!

Instead, aggregate it all together, like Google News does for news sites. Show me each core message once, and then show a (dense, appropriate) display of who/how the information came in. Include a sparkline and other charts to show the continued re-arrival of that same data. This way, I won’t have to endure the duplication directly, and I can more clearly see how information traverses the (underlying, human) social network.

Some Tweets are More Equal than Others

In an ideal world, every Facebook update, every Tweet, would be a precious flower, to be admired in depth. We don’t live there. Instead, there is a lot of noise; an example fresh in my mind as I write this is the TV show Lost. It may be a great show, but it’s not one I watch, so to me all the Lost chatter is noise. I’ve probably scanned/scrolled past a couple hundred of them (some of them duplicates) over the last few days.

Therefore, a good social media client will make it trivial (one click) for me to tell it which bits I am interested in and which I’m not. I’m not talking about a scoring system, just a simple up/down arrow, for a total of three bins:

  • Important
  • Bulk / default
  • Junk

Apply some automatic classification mechanism (like the naive Bayensian that’s been common for several years now in email spam filtering) to learn from my votes and apply those to future data. By default, highlight the Important, show the Bulk, and hide the Junk.

I Have Several Devices – Sync Them Now

I might look at this river of news on my Mac in the morning, then on my iPad at lunch, then on my Linux netbook in the evening, then sneak an iPhone peek at bedtime. Keep all that “what I’ve seen” and “what’s important” data in sync across them. This means that my dream social media client needs a backend service behind it. It is not necessary for the data feeds to flow through the backend system (thought it might be useful); just the user’s attention metadata.


I believe that most or all of those features will be common in a few years. But I’m annoyed by the tsunami of social media feeds now. Is something like this out there? Where?

I could build such an application (with some help!). I’ve worked with APIs of all flavors. I’ve done mobile. I’ve created GUIs that elicit a “Wow”. I understand servers, and asynchronous operations, and scalability, and SaaS. But if I built it, would anyone *buy* it?

The Prolog Story

I’ve told this story in person dozens of times, it’s time to write it down and share it here. I’ve again experimentally recorded a video version (below), which you can view on a non-Flash device here.

The Prolog Story from Kyle Cordes on Vimeo.

I know a little Prolog, which I learned in college – just enough to be dangerous. Armed with that, and some vigorous just-in-time online learning, I used Prolog in a production system a few years ago, with great results. There are two stories about that woven together here; one about the technical reasons for choosing this particular tool, and the other about the business payoff for taking a road less travelled.

In 2004 (or so) I was working on a project for an Oasis Digital customer on a client/server application with SQL Server behind it. This application worked (and still works) very well for the customer, who remains quite happy with it. This is the kind of project where there is an endless series of enhancement and additions, some of them to attack a problem-of-the-moment and some of them to enrich and strengthen the overall application capabilities.

The customer approached us with a very unusual feature request – pardon my generic description here; I don’t want to accidentally reveal any of their business secrets. The feature was described to us declaratively, in terms of a few rules and a bunch of examples of those rules. The wrinkle is that these were not “forward” rules (if X, do Y). Rather, these rules describe scenarios, such that if those scenarios happen, then something else should happen. Moreover, the rules were are on complex transitive/recursive relationships, the sort of thing that SQL is not well suited for.

An initial analysis found that we would need to implement a complex depth/breadth search algorithm either in the client application or in SQL. This wasn’t a straightforward graph search, though, rather that part was just the tip of the iceberg. I’m not afraid of algorithmic programming, Oasis Digital is emphatically not an “OnClick-only” programming shop, so I dug in. After spending a couple of days attacking the problem this way, I concluded that this would be a substantial block of work, at least several person-months to get it working correctly and efficiently. That’s not a lot in the grand scheme of things, but for this particular customer, this would use up their reasonable-but-not-lavish budget for months, even ignoring their other feature needs.

We set this problem aside for a few days, and upon more though I realized that:

  • this would be a simple problem to describe in Prolog
  • the Prolog runtime would then solve the problem
  • the Prolog runtime would be responsible for doing it correctly and efficiently, i.e. our customer would not foot the bill to achieve those things.

We proceeded with the Prolog approach.

….

It actually took one day of work to get it working, integrated, and into testing, then a few hours a few weeks later to deploy it.

The implementation mechanism is pretty rough:

  • The rules (the fixed portions of the Prolog solution) are expressed in a prolog source file, a page or two in length.
  • A batch process runs every N minutes, on a server with spare capacity for this purpose.
  • The batch process executes a set of SQL queries (in stored procs), returning a total of tens or hundreds of thousands of rows of data. SQL is used to format that query output as Prolog terms. These stored procs are executed using SQL Server BCP, making it trivial to save the results in files.
  • The batch process run a Prolog interpreter, passing the data and rules (both are code, both are data) as input. This takes up to a few minutes.
  • The Prolog rules are set up, with considerable hackery, to emit the output data we needed in the form of CSV data. This output is directed to a file.
  • SQL Server BCP imports this output data back in to the production SQL Server database.
  • The result of the computation is thus available in SQL tables for the application to use.

This batch process is not an optimal design, but it has the advantage of being quick to implement, and robust in operation. The cycle time is very small compared to the business processes being controlled, so practically speaking it is 95% as good as a continuous calculation mechanism, at much less cost.

There are some great lessons here:

  • Declarative >>> Imperative. This is among the most important and broad guidelines to follow in system design.
  • Thinking Matters. We cut the cost/time of this implementation by 90% or more, not by coding more quickly, but by thinking more clearly. I am a fan of TDD and incremental design, but you’re quite unlikely to ever make it from a handcoded solution to this simply-add-Prolog solution that way.
  • The Right Tool for the Job. Learn a lot of them, don’t be the person who only has a hammer.
  • A big problem is a big opportunity. It is quite possible that another firm would not have been able to deliver the functionality our customer needed at a cost they could afford. This problem was an opportunity to win, both for us and for our customer.

That’s all for now; it’s time for LessConf.