SQL on Hadoop


Citus DB: The Scalable Analytics Database

CitusDB is a distributed database that lets you run SQL queries over very large data sets. Designed for analytical queries, CitusDB enables real-time responsiveness.

Drill (Apache)

Apache Drill is a distributed system for interactive analysis of large-scale datasets, based on Google's Dremel. Its goal is to efficiently process nested data. It is a design goal to scale to 10,000 servers or more and to be able to process petabyes of data and trillions of records in seconds.

Giraph follows the bulk-synchronous parallel model relative to graphs where vertices can send messages to other vertices during a given superstep. Checkpoints are initiated by the Giraph infrastructure at user-defined intervals and are used for automatic application restarts when any worker in the application fails. Any worker in the application can act as the application coordinator and one will automatically take over if the current application coordinator fails.

Pivotal Advanced Database Services, powered by HAWQ, add SQL’s expressive power to Hadoop. By adding rich, mature SQL processing, Pivotal HD leverages existing BI and analytics products and your workforce’s SQL skills to simplify development, expand Hadoop’s capabilities, increase productivity, and cut costs.

Hadapt unifies SQL and Hadoop, enabling customers to analyze all of their data (structured, unstructured, and multi-structured) in a single platform – no connectors, complexities, or rigid structure.

the most powerful query engine built using proven parallel database technology of Greenplum Database for analyzing massive amount of data in Hadoop using industry standard SQL constructs.

Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At the same time this language also allows traditional map/reduce programmers to plug in their custom mappers and reducers when it is inconvenient or inefficient to express this logic in HiveQL.

3- Hive : Originally developed by Facebook, Hive is basically adata warehouse. It sits on top of your Hadoop cluster and provides you an SQL like interface to the data stored in your Hadoop cluster. You can then write SQLish queries using Hive's query language, called as HiveQL and perform operations like store, select, join, and much more. It makes processing a lot easier as you don't have to do lengthy, tedious coding. Write simple Hive queries and get the results. Isn't that cool??RDBMS folks will definitely love it. Simply map HDFS files to Hive tables and start querying the data. Not only this, you could map Hbase tables as well, and operate on that data.

Tip : Use Hive when you have warehousing needs and you are good at SQL and don't want to write MapReduce jobs. One important point though, Hive queries get converted into a corresponding MapReduce job under the hood which runs on your cluster and gives you the result. Hive does the trick for you. But each and every problem cannot be solved using HiveQL. Sometimes, if you need really fine grained and complex processing you might have to take MapReduce's shelter.

Cloudera Impala™ provides fast, interactive SQL queries directly on your Apache Hadoop data stored in HDFS or HBase. In addition to using the same unified storage platform, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC driver, and user interface (Hue Beeswax) as Apache Hive. This provides a familiar and unified platform for batch-oriented or real-time queries.

Lingual executes ANSI SQL queries as Cascading applications on Apache Hadoop clusters.

MRQL (pronounced miracle) is a query processing and optimization system for large-scale, distributed data analysis. MRQL (the MapReduce Query Language) is an SQL-like query language for large-scale data analysis on a cluster of computers. The MRQL query processing system can evaluate MRQL queries in two modes: in MapReduce mode on top of Apache Hadoop or in Bulk Synchronous Parallel (BSP) mode on top of Apache Hama.

Phoenix is a new and relatively unknown open source project that comes out of Salesforce.com and aims to allow fast SQL queries of data stored in HBase, the NoSQL database built atop HDFS. Its stated mission: “Become the standard means of accessing HBase data through a well-defined, industry standard API.” Users interact with it through JDBC interfaces, and its developers claim its sub-second response times for small queries and seconds-long response for querying tens of millions of rows.

The RainStor database is designed for massive scalability, continuous access for fast query and analytics with built-in enterprise-grade security. RainStor is built for efficiency and therefore provides significant cost savings and can run on premise across multiple data centers or in the cloud. RainStor is the only enterprise database that magnifies the cost and performance benefits you gain when managing Big Data, which enables you load data very fast, achieve significant data compression and continuous query and analyze with the most flexible approaches. RainStor’s patented technology enables you to store, manage and analyze extreme data volumes in a much more efficient and cost-effective manner compared to standard databases and data warehouses.

Enabling Hive to answer human-time use cases (i.e. queries in the 5-30 second range) such as big data exploration, visualization, and parameterized reports without needing to resort to yet another tool to install, maintain and learn can deliver a lot of value to the large community of users with existing Hive skills and investments.

Giraph (Apache)

Greenplum (EMC)




Impala (Cloudera)





Stinger Initiative