Framework / Languages

Cascading is a Java application framework that enables developers to quickly and easily build rich enterprise grade Data Processing and Machine Learning applications that can be deployed and managed across private or cloud-based Apache Hadoop clusters and API compatible distributions.

DataFu is a collection of Pig UDFs (user defined functions) for data analysis on Hadoop.

Decomposer is a collection of extremely large matrix decomposition algorithm implementations, in Java. It currently contains Singular Value Decomposition implementation and the library is in process of being ‘absorbed’ into the Apache Mahout project.

Glue is a job execution engine, written in Java and Groovy. workflows are written in Groovy DSL (simple statements) and use pre-developed modules to interact with external resources e.g. DBs, Hadoop, Netezza, FTP etc.

The Apache Gora open source framework provides an in-memory data model and persistence for big data. Gora supports persisting to column stores, key value stores, document stores and RDBMSs, and analyzing the data with extensive Apache Hadoop MapReduce support

Apache Hama is a pure BSP (Bulk Synchronous Parallel) computing framework on top of HDFS (Hadoop Distributed File System) for massive scientific computations such as matrix, graph and network algorithms.

Mortar is an open source framework for open languages: Pig, Java, and real Python (including NumPy, SciPy, NLTK, etc). Use the skills you already have, on Hadoop.

a free, open source, standard-based scoring engine that enables analysts and data scientists to quickly deploy machine-learning applications on Apache Hadoop™. Leveraging the power and broad platform support of the Cascading application framework, Pattern lowers the barrier to Hadoop adoption by enabling companies to leverage existing intellectual property (IP) in predictive models, existing investments in software tooling and the core competencies of existing analytics staff to run Big Data applications from existing machine-learning models using Predictive Model Markup Language (PMML) or through a simple programming interface.

Apache Pig is a platform for analyzing large data sets that consists of a high-level language for expressing data analysis programs, coupled with infrastructure for evaluating these programs. The salient property of Pig programs is that their structure is amenable to substantial parallelization, which in turns enables them to handle very large data sets.

4- Pig : Pig is a dataflow language that allows you to process enormous amounts of data very easily and quickly by repeatedly transforming it in steps. It basically has 2 parts, the PigInterpreter and the language, PigLatin. Pig was originally developed at Yahoo and they use it extensively. Like Hive, PigLatin queries also get converted into a MapReduce job and give you the result. You can use Pig for data stored both in HDFS and Hbase very conveniently. Just like Hive, Pig is also really efficient at what it is meant to do. It saves a lot of your effort and time by allowing you to not write MapReduce programs and do the operation through straightforward Pig queries.

Tip : Use Pig when you want to do a lot of transformations on your data and don't want to take the pain of writing MapReduce jobs.









Scalding a Scala API for Cascading Cascading is a thin Java library and API that sits on top of Apache Hadoop's MapReduce layer. Scalding is comprised of two main components:

  • a DSL to make MapReduce computations look very similar to Scala's collection API
  • A wrapper for Cascading to make it simpler to define the typical use cases of jobs, tests and describing data sources on a Hadoop Distributed File System (HDFS) or local disk


Tez is an effort to develop a generic application framework which can be used to process arbitrarily complex data-processing tasks and also a re-usable set of data-processing primitives which can be used by other projects.