Author Archives: Todor

SQL-on-Hadoop Engines


The dynamics of Big Data have pushed the development of many new SQL-like engines on top of Hadoop in the recent years. They were built for different applications and with different purposes in mind. This post summarizes the most popular engines, which are currently available  or in development with both open source and commercial licenses.

Apache Hive

The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.


Hive-on-Tez uses the Tez application framework to optimize the execution of the MapReduce jobs.


Hive-on-Spark enables Hive to run on top of Spark. (Still under development)

Apache Pig


Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.


Pig-on-Tez runs Pig on top of the Tez framework.


Pig-on-Spark runs Pig on top of Spark.

Apache Spark SQL

Apache Spark is a fast and general engine for large-scale data processing. Spark SQL is Spark’s module for working with structured data.

Cloudera Impala

Impala is a fully integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop – combining the familiar SQL support and multi-user performance of a traditional analytic database with the rock-solid foundation of open source Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise.

Apache Drill

Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores. Users can query the data using standard SQL and BI tools without having to create and manage schemas. Some of the key features are:

  • Schema-free JSON document model similar to MongoDB and Elasticsearch
  • Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
  • Extremely user and developer friendly
  • Pluggable architecture enables connectivity to multiple datastores

Apache Tajo

Apache Tajo

Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities.

Apache Phoenix

Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.


The phoenix-spark plugin extends Phoenix’s MapReduce support to allow Spark to load Phoenix tables as RDDs or DataFrames, and enables persisting them back to Phoenix.

Facebook Presto

Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.

Apache Flink

Apache Flink

Apache Flink is an open source platform for scalable batch and stream data processing. Includes Table API with a SQL-like expression language embedded in Java and Scala. (still in development)

Apache Kylin

Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, original contributed from eBay Inc.

Apache MRQL

MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, Spark, and Flink. MRQL (pronounced miracle) is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in four modes:

  • in Map-Reduce mode using Apache Hadoop,
  • in BSP mode (Bulk Synchronous Parallel mode) using Apache Hama,
  • in Spark mode using Apache Spark, and
  • in Flink mode using Apache Flink.


Big SQL leverages IBM’s strength in SQL engines to provide ANSI SQL access to data across any system from Hadoop, via JDBC or ODBC – seamlessly whether that data exists in Hadoop or a relational data base. This means that developers familiar with the SQL programming language can access data in Hadoop without having to learn new languages or skills.

Pivotal HAWQ

HAWQ is an advanced enterprise SQL on Hadoop analytic engine built around a robust and high-performance massively-parallel processing (MPP) SQL framework evolved from Pivotal Greenplum Database?. HAWQ runs natively on Apache Hadoop? clusters by tightly integrating with HDFS and YARN. HAWQ supports multiple Hadoop file formats such as Apache Parquet, native HDFS, and Apache Avro. HAWQ is configured and managed as a Hadoop service in Apache Ambari. HAWQ is 100% ANSI SQL compliant (supporting ANSI SQL-92, SQL-99, and SQL-2003, plus OLAP extensions) and supports open database connectivity (ODBC) and Java database connectivity (JDBC), as well. Most business intelligence, data analysis and data visualization tools work with HAWQ out of the box without the need for specialized drivers. (Proposal for Apache Incubator)

Microsoft PolyBase

PolyBase allows you to use T-SQL statements to access data stored in Hadoop or Azure Blob Storage and query it in an adhoc fashion. It also lets you query semi-structured data and join the results with relational data sets stored in SQL Server. PolyBase is optimized for data warehousing workloads and intended for analytical query scenarios.

Teradata Aster SQL-MapReduce

SQL-MapReduce is a framework created by Teradata Aster to allow developers to write powerful and highly expressive SQL-MapReduce functions in languages such as Java, C#, Python, C++, and R and push them into the discovery platform for high performance analytics. Analysts can then invoke SQL-MapReduce functions using standard SQL or R through Aster Database, the first discovery platform that allows applications to be fully embedded within the database engine to enable ultra-fast, deep analysis of massive data sets.


Extract maximum value from your Big Data by combining the in-memory processing power of SAP HANA with Hadoop’s ability to store and process huge amounts of data, regardless of structure.

SAP HANA Vora – an in-memory query engine for Apache Spark and Hadoop.

Oracle Big Data SQL

Oracle Big Data SQL

Extends Oracle SQL to Hadoop and NoSQL and the security of Oracle Database to all your data. It also includes a unique Smart Scan service that minimizes data movement and maximizes performance.

Cascading Lingual

Cascading Lingual is a powerful extension to Cascading that simplifies application development and integration by providing an ANSI SQL interface for Apache Hadoop. Now you can connect existing business intelligence (BI) tools, optimize computing costs, and accelerate application development with Hadoop.


RainStor takes it a step further and provides an end-to-end application that runs on Hadoop 2.0, enabling organizations to get up and running quickly and achieve faster time to business value. RainStor has supported Hadoop since 2011 and in fact was the first to announce native SQL on Hadoop and building upon these capabilities also provided enterprise-grade security since mid 2013. Extending upon that, RainStor 6 delivers full management capabilities that include YARN integration making it a first class citizen on Hadoop.

Splout SQL

Splout allows serving an arbitrarily big datasetwith high QPS rates and at the same time providesfull SQL query syntax. Splout is appropriated for web page serving, many low latency lookups, scenarios such as mobile or web applications with demanding performance.


Jethro logo_jethrois a SQL engine that super-powers your BI Tool software by taking a unique approach to accessing your big data. Our approach empowers you to analyze and interact with your data in real time.

Splice Machine

Splice Machine

Splice Machine delivers a database solution that incorporates the scalability of Hadoop, the standard ANSI SQL and ACID transactions of an RDBMS and the distributed computing power of HBase.


This week Facebook open sourced a project called osquery, which offers the ability to access low-level operating system information through simple SQL queries (more precisely SQL as understood by SQLite). More information for how to navigate through the tables can be found in the github page.

Installing/building osquery in Linux (in my case Ubuntu 14.04 LTS) is as follows:

git clone
cd osquery
make deps

Testing the project:   make test

Deploying and running it:  make install

make deps will take care of installing everything you need to compile osquery.

If you have any errors in your source list make deps will end with errors and osquery will not be installed, because the used packages are not available. Therefore make sure that you have the latest packages and don’t get any errors in the source.list:  sudo apt-get update  (also  sudo apt-get upgrade).  In case of errors, you can fix the source.list by editing:  sudo gedit /etc/apt/sources.list

Here is another good tutorial on installing and using osquery.

Upgrade VMware vSphere

I just managed to upgrade VMware vSphere 5.1 to 5.5. It was really easy and very effective process. First I upgraded the vCenter Server by downloading the latest vCenter Server Appliance and exchanging it with the old one. The step by step process is described here. Then I upgraded the ESXi Server 5.1 to 5.5 from the console by just following the few steps described here. After dowloading the update in zip file, copy it to the ESXi host, connect through sshand execute in the console :

esxcli software vib install -d /vmfs/volumes/<your_volume>/