MapReduce group by example

MapReduce Group By Example: Grouped Statistics of Airline On-time Performance Dataset

GitHub repo:

https://github.com/drweiwang/BigData/tree/master/grpstats

About the Dataset

Using the airline on-time performance dataset as input data. We are only intersted in the UniqueCarrier and ArrDelay columns in the dataset.

More information about the dataset can be found here: http://stat-computing.org/dataexpo/2009/

Find the maximum arrival delay gourped by airlines
MapReduce strategy
Map Phase

Mapper simply parsing the data line by line. Extract the fields of UniqueCarrier and ArrDelay. Write keys and values, where key is the UniqueCarrier and value is the numeric delay in IntWritable type.

Reduce Phase

Reducer iterate through the list of all delays associated with the key(one airline) and update the maximum value. Finally, the reducer writes the final maximum value out with the key (airline code)