MapReduce Group By Example: Grouped Statistics of Airline On-time Performance Dataset
GitHub repo:
https://github.com/drweiwang/BigData/tree/master/grpstats
About the Dataset
Using the airline on-time performance dataset as input data. We are only intersted in the UniqueCarrier and ArrDelay columns in the dataset.
More information about the dataset can be found here: http://stat-computing.org/dataexpo/2009/
Find the maximum arrival delay gourped by airlines
MapReduce strategy
Map Phase
Mapper simply parsing the data line by line. Extract the fields of UniqueCarrier and ArrDelay. Write keys and values, where key is the UniqueCarrier and value is the numeric delay in IntWritable type.
Reduce Phase
Reducer iterate through the list of all delays associated with the key(one airline) and update the maximum value. Finally, the reducer writes the final maximum value out with the key (airline code)