The Apache Ambari project is aimed at making Hadoop management simpler by developing software for provisioning, managing, and monitoring Apache Hadoop clusters. Ambari provides an intuitive, easy-to-use Hadoop management web UI backed by its RESTful APIs.

The set of Hadoop components that are currently supported by Ambari includes:

HDFS, MapReduce Hive, HCatalog, HBase, ZooKeeper, Oozie, Pig, Sqoop

Ambari enables System Administrators to:

  • Provision a Hadoop Cluster
    • Ambari provides a step-by-step wizard for installing Hadoop services across any number of hosts.
    • Ambari handles configuration of Hadoop services for the cluster.
  • Manage a Hadoop Cluster
    • Ambari provides central management for starting, stopping, and reconfiguring Hadoop services across the entire cluster.
  • Monitor a Hadoop Cluster
    • Ambari provides a dashboard for monitoring health and status of the Hadoop cluster.
    • Ambari leverages Ganglia for metrics collection.
    • Ambari leverages Nagios for system alerting and will send emails when your attention is needed (e.g., a node goes down, remaining disk space is low, etc).

Ambari enables Application Developers and System Integrators to:

  • Easily integrate Hadoop provisioning, management, and monitoring capabilities to their own applications with the Ambari REST APIs.

A batch job scheduling system with a friendly UI, Azkaban aims to make batch programming easy and visually appealing.

Bigtop is a project for the development of packaging and tests of the Apache Hadoop ecosystem.

The primary goal of Bigtop is to build a community around the packaging and interoperability testing of Hadoop-related projects. This includes testing at various levels (packaging, platform, runtime, upgrade, etc...) developed by a community with a focus on the system as a whole, rather than individual projects.

Corona, a new scheduling framework that separates cluster resource management from job coordination.[1] Corona introduces a cluster manager whose only purpose is to track the nodes in the cluster and the amount of free resources.

The Apache Crunch Java library provides a framework for writing, testing, and running MapReduce pipelines. Its goal is to make pipelines that are composed of many user-defined functions simple to write, easy to test, and efficient to run.

Falcon is a data processing and management solution for Hadoop designed for data motion, coordination of data pipelines, lifecycle management, and data discovery. Falcon enables end consumers to quickly onboard their data and its associated processing and management tasks on Hadoop clusters.

Ganglia is a scalable distributed monitoring system for high-performance computing systems such as clusters and Grids. It is based on a hierarchical design targeted at federations of clusters. It leverages widely used technologies such as XML for data representation, XDR for compact, portable data transport, and RRDtool for data storage and visualization. It uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency. The implementation is robust, has been ported to an extensive set of operating systems and processor architectures, and is currently in use on thousands of clusters around the world. It has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000 nodes.

Apache HCatalog is a table and storage management service for data created using Apache Hadoop.

The Hadoop Development Tools (HDT) is a set of plugins for the Eclipse IDE for developing against the Hadoop platform.

Helix, built on top of Apache Zookeeper, is a generic cluster management framework for automatic management of partitioned, replicated and distributed resources hosted on a cluster of nodes.

Hue is an open source UI for making it easier to use Apache Hadoop.

Hue features a File Browser for HDFS, a Job Designer/Browser for MapReduce, query editors for Hive, Pig and Cloudera Impala, an Oozie Application for creating workflows, various Shells, a collection of Hadoop API.

Apache Mesos is a cluster manager that provides efficient resource isolation and sharing across distributed applications, or frameworks.

Apache MRUnit ™ is a Java library that helps developers unit test Apache Hadoop map reduce jobs.

Norbert, implemented in Scala, wraps ZooKeeper and Netty and uses Protocol Buffers to provide easy cluster management and workload distribution. It is claimed to be capable of “quickly distribut(ing) a simple client/server architecture to create a highly scalable architecture capable of handling heavy traffic”

Oozie is a workflow scheduler system to manage Apache Hadoop jobs.

6- Oozie : Now you have everything in place and want to do the processing but find it crazy to start the jobs and manage the workflow manually all the time. Specially in the cases when it is required to chain multiple MapReduce jobs together to achieve a goal. You would like to have some way to automate all this. No worries, Oozie comes to the rescue. It is a scalable, reliable and extensible workflow scheduler system. You just define your workflows(which are Directed Acyclical Graphs) once and rest is taken care by Oozie. You can schedule MapReduce jobs, Pig jobs, Hive jobs, Sqoop imports and even your Java programs using Oozie.

Tip : Use Oozie when you have a lot of jobs to run and want some efficient way to automate everything based on some time (frequency) and data availabilty.

StackIQ Enterprise Data brings enterprise-class management to Big Data clusters. It was designed from the ground up to deploy and manage large-scale cluster infrastructure. It combines StackIQ’s industry-leading cluster management solution with our Apache Hadoop™ management software providing everything you need to install, configure, deploy, and manage your Apache Hadoop™ cluster — right from bare metal. Using StackIQ Enterprise Data makes the job of building a robust, production-grade big data cluster that can reside in any enterprise data center easier than ever.

a performance diagnostic tool for map/reduce jobs.

WANdisco's patented replication technology turns the NameNode into an active-active shared-nothing cluster that delivers optimum performance, scalability and availability on a 24-by-7 basis without any downtime or data loss.

Apache Whirr is a set of libraries for running cloud services.

Whirr provides:

  • A cloud-neutral way to run services. You don't have to worry about the idiosyncrasies of each provider.
  • A common service API. The details of provisioning are particular to the service.
  • Smart defaults for services. You can get a properly configured system running quickly, while still being able to override settings as needed.

You can also use Whirr as a command line tool for deploying clusters.

White Elephant is used to parse Hadoop logs and provide visualization dashboard for Hadoop cluster statistics, including total task time, slots used, CPU time, and failed job counts. White Elephant’s server is a JRuby application, also deployable on Tomcat while the data is stored in HyperSQL in-memory DB and charts rendered with Rickshaw.

Zettaset Orchestrator is a management platform built on enterprise software that automates, accelerates, and simplifies Hadoop installation and cluster management for Big Data deployments, and delivers faster time to value. Zettaset Orchestrator™ is not a Hadoop distribution, but operates as an independent management layer that sits on top of any open-source-based Hadoop distribution. Orchestrator meets the exacting requirements of enterprises for security, high availability, and performance within the Hadoop cluster environment.

  • Hardened to meet enterprise security, high availability, and performance requirements
  • Software automation simplifies Hadoop deployment and eliminates unnecessary dependencies on professional services
  • Dramatically lowers operational expenses by reducing IT resource requirements
  • Accelerates time to value from weeks to hours
  • Designed to manage any Apache Hadoop distribution and cluster environment

ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. All of these kinds of services are used in some form or another by distributed applications. Each time they are implemented there is a lot of work that goes into fixing the bugs and race conditions that are inevitable. Because of the difficulty of implementing these kinds of services, applications initially usually skimp on them ,which make them brittle in the presence of change and difficult to manage. Even when done correctly, different implementations of these services lead to management complexity when the applications are deployed.