Home‎ > ‎Hadoop Ecosystem‎ > ‎Core Layers‎ > ‎Persist‎ > ‎

Operational Databases

 
Accumulo

http://accumulo.apache.org/


The Apache Accumulo™ sorted, distributed key/value store is a robust, scalable, high performance data storage and retrieval system. Apache Accumulo is based on Google's BigTable design and is built on top of Apache Hadoop,Zookeeper, and Thrift. Apache Accumulo features a few novel improvements on the BigTable design in the form of cell-based access control and a server-side programming mechanism that can modify key/value pairs at various points in the data management process. 

Avro

http://avro.apache.org/


 Apache Avro™ is a data serialization system.


8- Avro Avro is a data serialization system. It provides functionalities similar to systems like Protocol Buffers, Thrift etc. In addition to that it provides some other significant features like rich data structures, a compact, fast, binary data format, a container file to store persistent data, RPC mechanism and pretty simple dynamic languages integration. And the best part is that Avro can easily be used with MapReduce, Hive and Pig. Avro uses JSON for defining data types.

Tip : Use Avro when you want to serialize your BigData with good flexibility.

Cassandra

http://cassandra.apache.org/

 
The Apache Cassandra database is the right choice when you need scalability and high availability without compromising performance. Linear scalability and proven fault-tolerance on commodity hardware or cloud infrastructure make it the perfect platform for mission-critical data. Cassandra's support for replicating across multiple datacenters is best-in-class, providing lower latency for your users and the peace of mind of knowing that you can survive regional outages.

Cassandra's data model offers the convenience of column indexes with the performance of log-structured updates, strong support for denormalization and materialized views, and powerful built-in caching. 


Drawn to Scale

http://drawntoscale.com/


 Spire is the first database for large, user-facing applications built on Hadoop. Supporting SQL and MongoDB queries in addition to MapReduce, Spire is built to power large-scale websites, mobile apps, and machine-to-machine data — without sacrificing analytics.

Unlike any other Hadoop and SQL solution, Spire scales to tens of thousands of reads and writes per second, with full ANSI SQL and intuitive management tools.

[Company Closed May 2013]


HBase

http://hbase.apache.org/

 

Apache HBase™ is the Hadoop database, a distributed, scalable, big data store.

When Would I Use Apache HBase?

Use Apache HBase when you need random, realtime read/write access to your Big Data. This project's goal is the hosting of very large tables -- billions of rows X millions of columns -- atop clusters of commodity hardware. Apache HBase is an open-source, distributed, versioned, column-oriented store modeled after Google's Bigtable: A Distributed Storage System for Structured Data by Chang et al. Just as Bigtable leverages the distributed data storage provided by the Google File System, Apache HBase provides Bigtable-like capabilities on top of Hadoop and HDFS. 


2- Hbase Hbase is a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs. It's basically a database, a NoSQL database and like any other database it's biggest advantage is that it provides you random read/write capabilities. As I have mentioned earlier, Hadoop is not very good for your real time needs, so you can use Hbase to serve that purpose. If you have some data which you want to access real time, you could store it in Hbase. Hbase has got it's own set of very good API which could be used to push/pull the data. Not only this, Hbase can be seamlessly integrated with MapReduce so that you can do bulk operation, like indexing, analytics etc etc.

Tip : You could use Hadoop as the repository for your static data and Hbase as the datastore which will hold data that is probably gonna change over time after some processing.

Lily

http://www.lilyproject.org/lily/index.html

 
 
Lily is Smart Data, at Scale, made Easy.

Lily is a data management platform combining planet-sized data storage, indexing and search with on-line, real-time usage tracking, audience analytics and content recommendations. It's a one-stop-platform for any organization confronted with Big Data challenges that seeks rapid implementation, rock-solid performance at scale, and efficiency at management.

Lily unifies Apache HBase, Hadoop and Solr into a comprehensively integrated, interactive data platform with easy-to-use access APIs, a high-level data model and schema language, flexible, real-time indexing and the expressive search power of Apache Solr. Best of all, Lily is open source - allowing anyone to explore and learn what Lily can do.

 
Parquet

http://parquet.io/

 

Parquet is a columnar storage format for Hadoop.

We created Parquet to make the advantages of compressed, efficient columnar data representation available to any project in the Hadoop ecosystem, regardless of the choice of data processing framework, data model, or programming language. 


Splice Machine

http://www.splicemachine.com/


Splice Machine enables you to build Big Data applications with all the benefits of NoSQL databases while still leveraging standard SQL. The result: massive scalability, flexible schema, fault tolerance and high availability along with easy integration to BI tools.

Sqrrl

http://sqrrl.com/
 

sqrrl is the creator of sqrrl enterprise, which is a secure, massively scalable operational data store powered by Apache Accumulo. sqrrl enterprise is the only solution that offers fine-grained security controls, petabyte scalability for multi-structured data, native Hadoop integration, and diverse analytic capabilities, including full-text search, statistics, and graph analysis. 

TempoDB

https://tempo-db.com/
 

Time series data storage
Simple REST API offers easy data storage and retrieval
No downsampling so data is stored at full resolution (1ms max)
Automated scaling stores as much data as you need
3X data replication guarantees data availability
Hosted or deployed solutions with multiple providers

Time series data analysis

Constant query times enable predictable performance
Rollups (15 min, 7 day, etc) provide the right level of detail
Range summaries generate statistical answers automatically
Timezone support enables flexible reporting and analysis
Tags and attributes make it easy to filter and add context to data


Thrift

http://wiki.apache.org/hadoop/Hbase/ThriftApi
 
 
client API for Hbase

Trevni

http://avro.apache.org/docs/current/trevni/spec.html
 
 

A Column File Format

 
 
Voldermort

http://www.project-voldemort.com/voldemort/
 

Voldemort is a distributed key-value storage system

 

Comments