Home‎ > ‎Hadoop Ecosystem‎ > ‎Core Layers‎ > ‎

Transfer

 
Chukwa

http://incubator.apache.org/chukwa/
 

Chukwa is an open source data collection system for monitoring large distributed systems. Chukwa is built on top of the Hadoop Distributed File System (HDFS) and Map/Reduce framework and inherits Hadoop’s scalability and robustness. Chukwa also includes a flexible and powerful toolkit for displaying, monitoring and analyzing results to make the best use of the collected data.


7- Flume/Chukwa : Both Flume and Chukwa are data aggregation tools and allow you to aggregate data in an efficient, reliable and distributed manner. You can pick data from some place and dump it into your cluster. Since you are handling BigData, it makes more sense to do it in a distributed and parallel fashion which both these tools are very good at. You just have to define your flows and feed them to these tools and rest of things will be done automatically by them.

Tip : Go for Flume/Chukwa when you have to aggregate huge amounts of data into your Hadoop environment in a distributed and parallel manner.
 
Flume

http://flume.apache.org/
 

Flume is a distributed, reliable, and available service for efficiently collecting, aggregating, and moving large amounts of log data. It has a simple and flexible architecture based on streaming data flows. It is robust and fault tolerant with tunable reliability mechanisms and many failover and recovery mechanisms. It uses a simple extensible data model that allows for online analytic application.


7- Flume/Chukwa : Both Flume and Chukwa are data aggregation tools and allow you to aggregate data in an efficient, reliable and distributed manner. You can pick data from some place and dump it into your cluster. Since you are handling BigData, it makes more sense to do it in a distributed and parallel fashion which both these tools are very good at. You just have to define your flows and feed them to these tools and rest of things will be done automatically by them.

Tip : Go for Flume/Chukwa when you have to aggregate huge amounts of data into your Hadoop environment in a distributed and parallel manner.
  
Kafka

http://kafka.apache.org/

Kafka provides a publish-subscribe solution that can handle all activity stream data and processing on a consumer-scale web site. This kind of activity (page views, searches, and other user actions) are a key ingredient in many of the social feature on the modern web. This data is typically handled by "logging" and ad hoc log aggregation solutions due to the throughput requirements. This kind of ad hoc solution is a viable solution to providing logging data to an offline analysis system like Hadoop, but is very limiting for building real-time processing. Kafka aims to unify offline and online processing by providing a mechanism for parallel load into Hadoop as well as the ability to partition real-time consumption over a cluster of machines.

Loom

http://www.revelytix.com/?q=content/loom


 Loom provides dynamic dataset management for Hadoop
  
Lucene

http://lucene.apache.org/

provides Java-based indexing and search technology, as well as spellchecking, hit highlighting and advanced analysis/tokenization capabilities.
  
Nutch

http://nutch.apache.org/

Apache Nutch is an open source web-search software project.
   
RushLoader

http://bigdata.pervasive.com/Products/RushLoader-for-Hadoop.aspx


Pervasive RushLoader for Hadoop is a FREE quick, easy way to get your data into the Hadoop File System
Hadoop was created to store and process massive amounts of data, but first you have to get that massive data into the Hadoop File System. RushLoader for Hadoop solves the problem.
Powered by the unmatched parallel performance speed of the Pervasive DataRush engine, and built in the award-winning, easy-to-use open source KNIME interface, Pervasive Rushloader for Hadoop gets your data into Hadoop fast. And it's FREE.
  • Accesses data from a variety of data sources - standard RDBMS, text files from many file systems including Amazon File System, ARFF formats, log files, PMML, etc.
  • Executes on any platform with a JVM - Windows, Mac, UNIX, ...
  • Automatically scales up with no redesign and NO CODING - laptop to server to cluster
  • Loads data into Hadoop file system(HDFS) at extreme parallel speeds 
  • FREE!!
  
Scribe

https://github.com/facebook/scribe/wiki

Scribe is a server for aggregating streaming log data. It is designed to scale to a very large number of nodes and be robust to network and node failures.
 
SharePlex

http://www.quest.com/shareplex-for-oracle/
   

SharePlex is a replication and data integration solution for Oracle 10g and 11g and other databases that ensures business continuity while maximizing operational efficiencies. It provides a near real-time copy of production data without impacting your OLTP system’s performance and availability. You can replicate multiple copies of data on premises, remotely, or in the cloud to:

  • Ensure high availability and fast disaster recovery
  • Minimize risks associated with migrations
  • Improve performance of OLTP systems
  • Harness real-time reporting and data warehousing
  • Optimize the use of business intelligence applications
This high-performance, cost-effective alternative to other Oracle 10g and 11g replication tools delivers maximum availability. And only SharePlex can provide data compare and repair and in-flight data integrity, plus monitoring and alerting functionalities –- all in one package. 
  
Solr

http://lucene.apache.org/solr/

SolrTM is the popular, blazing fast open source enterprise search platform from the Apache LuceneTMproject. Its major features include powerful full-text search, hit highlighting, faceted search, near real-time indexing, dynamic clustering, database integration, rich document (e.g., Word, PDF) handling, and geospatial search. Solr is highly reliable, scalable and fault tolerant, providing distributed indexing, replication and load-balanced querying, automated failover and recovery, centralized configuration and more.
  
Sqoop

http://sqoop.apache.org/


Apache Sqoop(TM) is a tool designed for efficiently transferring bulk data between Apache Hadoop and structured datastores such as relational databases.


5- Sqoop : Sqoop is a tool that allows you to transfer data between relational databases and Hadoop. It supports incremental loads of a single table or a free form SQL query as well as saved jobs which can be run multiple times to import updates made to a database since the last import. Not only this, imports can also be used to populate tables in Hive or HBase. Along with this Sqoop also allows you to export the data back into the relational database from the cluster.

Tip : Use Sqoop when you have lots of legacy data and you want it to be stored and processed over your Hadoop cluster or when you want to incrementally add the data to your existing storage.

BigMemory

http://www.terracotta.org/



The world’s premier distributed in-memory management solution for extremely low, predictable latency at any scale. Nothing else goes this big and this fast! Use it for free or trial the full edition.



Comments