When I checked my email this morning, I saw an announcement for Amazon Elastic MapReduce. This is built on an open-source project called Hadoop, which I’ve been following off an on for the last couple of years. It recently made the New York Times, and there’s a nice picture of the relevant geeks on the front of the article.
Hadoop provides an implementation of a simple distributed computing algorithm called Map/Reduce. What Map/Reduce does is allow a user to split a computation job up into smaller units that can be distributed across a large number of servers to speed computation. Google pioneered the technique to build, well, Google. The Hadoop project also includes a distributed file system (allowing management of large quantities of data), and a distributed database system. The latter is similar to the Google’s BigTable system, which underlies most of Google’s applications, including (to my understanding) Google Health.
What’s new today is ease of access. People have been running Hadoop on Amazon for a while, and Amazon has just taken some of the configuration hassles away, including integrating Hadoop into the AWS Console. That’s an interesting decision in and of itself, since Amazon hasn’t gotten around to implementing management for S3 (their storage product, the first cloud offering they made) or some of their other, more established, features. I think it means they’re signaling to third party developers that they’re trying hard not to stomp on external innovation.
In the informatics world, Hadoop means scale. Genetic analysis algorithms are susceptible to the Map/Reduce framework. So is number crunching for claims analysis and public health reporting, and data management for large personal health applications. Right now I’m working with a couple of very innovative projects as part of a Translational Informatics on the Cloud “palaver” organized by Professors Peter Tonellato and Dennis Wall at Harvard Medical School. Peter and Dennis, in particular, are doing some very interesting projects, and you can watch along on the weekly web casts. We haven’t talked about how their work would change with these algorithms and this infrastructure, but if the problems are amenable a lot of the purely logistical problems they’ve been grappling with will go away.
All of this blends into a larger theme – tools that put researchers and developers closer to solving a problem, with less time spent on plumbing. Hadoop was vaguely scary technology to get running, particularly for people whose primary interest isn’t infrastructure. Now it isn’t.