Map Reduce Framework and Basics :
- MapReduce is a software framework for processing data sets in a distributed fashion over several machines.
- Prior to Hadoop 2.0, MapReduce was the only way to process data in Hadoop.
- A MapReduce job usually splits the input data set into independent chunks, which are processed by the map tasks in a completely parallel manner.
- The core idea behind MapReduce is mapping your data set into a collection of < key, value> pairs, and then reducing overall pairs with the same key.
- The framework sorts the outputs of the maps, which are then inputted to the reduced tasks.
- Both the input and the output of the job are stored in a file system.
- The framework takes care of scheduling tasks, monitors them, and re-executes the failed tasks.
The overall concept is simple :
1. Almost all data can be mapped into pairs somehow, and
2.Your keys and values may be of any type: strings, integers, dummy types and, of course, pairs themselves.
0 Comments