Category Archives: Big Data

Big Data Architectures

At the QCon London 2012 Ashish Thusoo presents the data scalability issues at Facebook and the data architecture evolution from EDW to Hadoop to Puma. Video of the talk and slides are availbale on the infoq.com page.

Talking about Big Data architectures it is exciting to look at the Klout architecture and how they combine different technologies to achieve their goal. Slide 4 from the presentation embedded below shows best the their ecosystem of technologies:

  • MySQL
  • MongoDB
  • HBase (Hadoop)
  • Pig and Hive (Hadoop)
  • Node.js
  • ObjectiveC
  • Scala
  • and more

The Big Data World

Dave Feinleib published a great picture below that very accurately summarizes the current state of the Big Data Landscape. It shows how the big companies are catching with the trend and adapting their technologies to address the Big Data problems. In the mean time, there are multitude of emerging new startup companies that fill the missing parts in the landscape. And there is much more to come  in the next years as the dynamics in the Big Data World will definitively increase along with the development and standardization of basic technology drivers such as Hadoop and Cassandra.

Virtualization-aware Hadoop

Recently VMWare started a new open source project called Serengeti aiming to improve the Hadoop usage and performance in virtual environments. It is no surprise that VMWare is going in this direction, as they announced Spring for Hadoop just a few months ago. This is a clear sign that they take Hadoop very seriously and push it further to become a standard enterprise platform that will serve them on top of vSphere cloud platform.

“Hadoop must become friendly with the technologies and practices of enterprise IT if it is to become a first-class citizen within enterprise IT infrastructure. The resource-intensive nature of large Big Data clusters make virtualization an important piece that Hadoop must accommodate,” said Tony Baer, Principal Analyst at OVUM. “VMware’s involvement with the Apache Hadoop project and its new Serengeti Apache project are critical moves that could provide enterprises the flexibility that they will need when it comes to prototyping and deploying Hadoop.” [source]

Another company Atlantis, just announced new solution called Atlantis ILIO FlexCloud to boost the virtual performance of data-intensive application such as Hadoop by caching the IO requests or even the entire application in RAM. The main features of FlexCloud are:

  • Application Characterization – Identifies and maps storage IO traffic characteristics of the application and responds intelligently based on patterns.
  • Inline IO Deduplication – Eliminates duplicate Write and Read IO traffic to reduce the amount of storage traffic.
  • IO Processing – Processes IO requests from the hypervisor so that processing occurs in memory instead of being serviced by storage resulting in improved overall performance.
  • Scatter/Gather Coalescing – Transforms smaller randomized IO traffic into easier to consume larger sequential blocks to further boost network and storage efficiency.
  • Fast Clone – Creates new virtual machine clones on demand using Atlantis ILIO without copying data from storage or introducing performance overhead.

Marvell is also offering similar solutions that improve the virtualized IO-intensive applications. DragonFly is such virtual storage hyper-accelerator assembled as a hardware system-on-chips device that  targets to improve scalability and performance of NAS/SAN arrays and enterprise servers.