The dynamics of Big Data have pushed the development of many new SQL-like engines on top of Hadoop in the recent years. They were built for different applications and with different purposes in mind. This post summarizes the most popular engines, which are currently available or in development with both open source and commercial licenses.
The Apache Hive data warehouse software facilitates querying and managing large datasets residing in distributed storage. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.
Hive-on-Tez uses the Tez application framework to optimize the execution of the MapReduce jobs.
Hive-on-Spark enables Hive to run on top of Spark. (Still under development)
Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.
Pig-on-Tez runs Pig on top of the Tez framework.
Pig-on-Spark runs Pig on top of Spark.
Apache Spark SQL
Impala is a fully integrated, state-of-the-art analytic database architected specifically to leverage the flexibility and scalability strengths of Hadoop – combining the familiar SQL support and multi-user performance of a traditional analytic database with the rock-solid foundation of open source Apache Hadoop and the production-grade security and management extensions of Cloudera Enterprise.
Drill is an innovative distributed SQL engine designed to enable data exploration and analytics on non-relational datastores. Users can query the data using standard SQL and BI tools without having to create and manage schemas. Some of the key features are:
- Schema-free JSON document model similar to MongoDB and Elasticsearch
- Industry-standard APIs: ANSI SQL, ODBC/JDBC, RESTful APIs
- Extremely user and developer friendly
- Pluggable architecture enables connectivity to multiple datastores
Apache Tajo is a robust big data relational and distributed data warehouse system for Apache Hadoop. Tajo is designed for low-latency and scalable ad-hoc queries, online aggregation, and ETL (extract-transform-load process) on large-data sets stored on HDFS (Hadoop Distributed File System) and other data sources. By supporting SQL standards and leveraging advanced database techniques, Tajo allows direct control of distributed execution and data flow across a variety of query evaluation strategies and optimization opportunities.
Apache Phoenix is a relational database layer over HBase delivered as a client-embedded JDBC driver targeting low latency queries over HBase data. Apache Phoenix takes your SQL query, compiles it into a series of HBase scans, and orchestrates the running of those scans to produce regular JDBC result sets. The table metadata is stored in an HBase table and versioned, such that snapshot queries over prior versions will automatically use the correct schema. Direct use of the HBase API, along with coprocessors and custom filters, results in performance on the order of milliseconds for small queries, or seconds for tens of millions of rows.
The phoenix-spark plugin extends Phoenix’s MapReduce support to allow Spark to load Phoenix tables as RDDs or DataFrames, and enables persisting them back to Phoenix.
Presto is an open source distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes.
Apache Flink is an open source platform for scalable batch and stream data processing. Includes Table API with a SQL-like expression language embedded in Java and Scala. (still in development)
Apache Kylin is an open source Distributed Analytics Engine designed to provide SQL interface and multi-dimensional analysis (OLAP) on Hadoop supporting extremely large datasets, original contributed from eBay Inc.
MRQL is a query processing and optimization system for large-scale, distributed data analysis, built on top of Apache Hadoop, Hama, Spark, and Flink. MRQL (pronounced miracle) is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in four modes:
- in Map-Reduce mode using Apache Hadoop,
- in BSP mode (Bulk Synchronous Parallel mode) using Apache Hama,
- in Spark mode using Apache Spark, and
- in Flink mode using Apache Flink.
IBM Big SQL
Big SQL leverages IBM’s strength in SQL engines to provide ANSI SQL access to data across any system from Hadoop, via JDBC or ODBC – seamlessly whether that data exists in Hadoop or a relational data base. This means that developers familiar with the SQL programming language can access data in Hadoop without having to learn new languages or skills.
HAWQ is an advanced enterprise SQL on Hadoop analytic engine built around a robust and high-performance massively-parallel processing (MPP) SQL framework evolved from Pivotal Greenplum Database?. HAWQ runs natively on Apache Hadoop? clusters by tightly integrating with HDFS and YARN. HAWQ supports multiple Hadoop file formats such as Apache Parquet, native HDFS, and Apache Avro. HAWQ is configured and managed as a Hadoop service in Apache Ambari. HAWQ is 100% ANSI SQL compliant (supporting ANSI SQL-92, SQL-99, and SQL-2003, plus OLAP extensions) and supports open database connectivity (ODBC) and Java database connectivity (JDBC), as well. Most business intelligence, data analysis and data visualization tools work with HAWQ out of the box without the need for specialized drivers. (Proposal for Apache Incubator)
PolyBase allows you to use T-SQL statements to access data stored in Hadoop or Azure Blob Storage and query it in an adhoc fashion. It also lets you query semi-structured data and join the results with relational data sets stored in SQL Server. PolyBase is optimized for data warehousing workloads and intended for analytical query scenarios.
Teradata Aster SQL-MapReduce
SQL-MapReduce is a framework created by Teradata Aster to allow developers to write powerful and highly expressive SQL-MapReduce functions in languages such as Java, C#, Python, C++, and R and push them into the discovery platform for high performance analytics. Analysts can then invoke SQL-MapReduce functions using standard SQL or R through Aster Database, the first discovery platform that allows applications to be fully embedded within the database engine to enable ultra-fast, deep analysis of massive data sets.
SAP HANA Vora – an in-memory query engine for Apache Spark and Hadoop.
Extends Oracle SQL to Hadoop and NoSQL and the security of Oracle Database to all your data. It also includes a unique Smart Scan service that minimizes data movement and maximizes performance.
Cascading Lingual is a powerful extension to Cascading that simplifies application development and integration by providing an ANSI SQL interface for Apache Hadoop. Now you can connect existing business intelligence (BI) tools, optimize computing costs, and accelerate application development with Hadoop.
RainStor takes it a step further and provides an end-to-end application that runs on Hadoop 2.0, enabling organizations to get up and running quickly and achieve faster time to business value. RainStor has supported Hadoop since 2011 and in fact was the first to announce native SQL on Hadoop and building upon these capabilities also provided enterprise-grade security since mid 2013. Extending upon that, RainStor 6 delivers full management capabilities that include YARN integration making it a first class citizen on Hadoop.
Splout allows serving an arbitrarily big datasetwith high QPS rates and at the same time providesfull SQL query syntax. Splout is appropriated for web page serving, many low latency lookups, scenarios such as mobile or web applications with demanding performance.
Jethro is a SQL engine that super-powers your BI Tool software by taking a unique approach to accessing your big data. Our approach empowers you to analyze and interact with your data in real time.
Splice Machine delivers a database solution that incorporates the scalability of Hadoop, the standard ANSI SQL and ACID transactions of an RDBMS and the distributed computing power of HBase.