As some of you who know me are aware, I am a big fan of commodity hardware-based distributed computing as a mechanism for solving high-order problems. I first became aware of this type of solution when I read Google’s Map-Reduce presentation, that started much of this ball rolling. The use of cheap PC’s in large (> 1000) clusters to solve data-intensive processing problems, gave Google a significant competitive advantage, particularly in their earlier years. I think it’s noteworthy that Yahoo and Microsoft both elected to start using huge server farms for their search efforts in response to this innovation from Google; Microsoft using Windows server deployed over ‘000s of PC’s, while Yahoo has decided to support an Open Source project from Apache called Hadoop. Hadoop is pretty cool, in that it builds out much of the functionality seen in Google’s proprietary Map-Reduce implementation, but using all open-source. The infrastructure stack is LAMP, while the actual code is written in Java, giving Hadoop a pretty significant cross-platform compatibility metric. Yahoo’s Web 2.0 maven Jeremy Zawodny has been blogging about Hadoop use in Yahoo, and the technobloggers have been taking note (this is a recent one from Tim O’Reilly).
So what does this mean for me? Not sure yet, but it is clear that my initial experiments in distributed computing using ClusterKnoppix (a Linux Debian derivative) and Mosix, are really just a small part of a total research effort. Some of my questions are:
- What is the largest cluster that I can build?
- What would be the speed/rating?
- Power consumption? Heat output? MTBF?
- What can I DO with it? Really?
I can probably put together a 2-3 node Hadoop cluster pretty easily using my existing hardware, but this still begs questions: what is the overall efficiency of a sub-1000 node distributed computing cluster, and what problems would be easily amenable to resolution using such a platform? As part of my efforts to understand (and hopefully deploy) a usable, useful instance of this technology, I’ve been looking at a number of study and training courses on Map-Reduce from Google. Of course, all this has to be done after I finish working on my other distributed computing effort: my Master’s thesis, which, I’m pleased to say, has been approved for project work.
If you are interested in this stuff, feel free to contact me, I’d love to hear some perspectives from other folks.