Category Archives: Hadoop

Hadoop YARN

There is a widespread discussion on the new MapReduce release, called YARN (Yet Another Resource Negotiator), which is shipped with the latest Hadoop 2.0 version. Curt Monash tries to give a clear perspective on the multitude of releases in his post: Hadoop YARN – beyond MapReduce.

The new MapReduce YARN promises significant improvements in reliability, availability, scalability, backward (and forward) compatibility, predictable latency and cluster utilization. This results in architectural and design changes as depicted in the Arun C Murthy‘s  YARN Architecture:

The major difference is that the JobTracker is divided into:

  • ResourceManager that manages the global assignment of compute resources to applications.
  • ApplicationMaster manages the application’s scheduling and coordination.

Also, the communication between the different Nodes is simplified which allows greater scalability. A prototype build on YARN that clearly demonstrates its advantages is extensively described in PaaS on Hadoop Yarn – Idea and Prototype. Despite many issues and failures in the current implementation, the framework will open new application fields that were not possible with the old version.

Big Data Architectures

At the QCon London 2012 Ashish Thusoo presents the data scalability issues at Facebook and the data architecture evolution from EDW to Hadoop to Puma. Video of the talk and slides are availbale on the infoq.com page.

Talking about Big Data architectures it is exciting to look at the Klout architecture and how they combine different technologies to achieve their goal. Slide 4 from the presentation embedded below shows best the their ecosystem of technologies:

  • MySQL
  • MongoDB
  • HBase (Hadoop)
  • Pig and Hive (Hadoop)
  • Node.js
  • ObjectiveC
  • Scala
  • and more

The Big Data World

Dave Feinleib published a great picture below that very accurately summarizes the current state of the Big Data Landscape. It shows how the big companies are catching with the trend and adapting their technologies to address the Big Data problems. In the mean time, there are multitude of emerging new startup companies that fill the missing parts in the landscape. And there is much more to come  in the next years as the dynamics in the Big Data World will definitively increase along with the development and standardization of basic technology drivers such as Hadoop and Cassandra.