MapReduce - pipeline-processing no longer a pipe-dream
I gave the Google MapReduce paper a quick read today. It outlines the approach Google uses to abstract over paralleled computation, distribution of data and fault tolerance associated with deploying large data-sets over large clusters of commoditised hardware. Bottom line, any web startup with significant scale aspirations should raise MapReduce or the open source Hadoop as a brown-bag topic for the engineering team.
Thoughts:
- Adapting a Functional programming model and the discipline of atomic task execution is a smart way to reduce complexity for many internet scale information processing problems.
- The MapReduce library significantly reduces the knowledge acquisition phase for new Google engineers, and hence helps scale people as much as it helps to scale code.
- MapReduce is already being used in some machine-learning applications. It is likely that the corollary is also true. Numenta’s hierarchical temporal memory (HTM) model is very interesting in its addition of cascasding expectation optimization.
- Prioritization of tasks corresponding to different business applications would be an emerging issue. If the end-game is building an internet operating system to be made available to all developers a la Amazon’s Elastic Computing Cloud (and it should be!) this resource management and task isolation is key.
- Could the master-worker communication be replaced with P2P between workers?
No comments yet. Be the first.
Leave a reply