Hadoop MapReducer string and math operations as output

I am building mapreducer application which takes .txt with random numbers as input and i want to receive an output information like this:

Max number: xx Arithmetic avg: xx Geometric avg: xx median: xx

My Code:

import java.io.IOException;
import java.util.StringTokenizer;
import java.util.*;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class NumCount {

  public static class TokenizerMapper
       extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text number = new Text();
    

    public void map(Object key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        number.set(itr.nextToken());
        context.write(number, one);
      }
    }
  }

  public static class IntSumReducer
       extends Reducer<Text,IntWritable,Text,IntWritable> {
    private IntWritable result = new IntWritable();

    public void reduce(Text key, Iterable<IntWritable> values,
                       Context context
                       ) throws IOException, InterruptedException {

      List<Integer> numList = new ArrayList<Integer>();

      for (IntWritable val : values) {
         numList.add(val.get());
      }
    
      // Max number from file
      int maxNumber = Collections.max(numList,null);

      // Arithmetic average
      float sum = 0;
      for (int i : numList)
        sum  = i;

      float arithmeticAverage = sum / numList.size();

      // Geometric average
      sum = 1;
      for (int i : numList)
        sum *= i;
      
      double geometricAverage = Math.pow(sum, (float)1/numList.size());

      // Median

      float median;
      
      if (numList.size() % 2 == 0)
         median = (float)(numList.get(numList.size()/2)   numList.get(numList.size()/2 - 1))/2;
      else
         median = numList.get(numList.size()/2);

      String summary = "Max number: "   maxNumber   "\nArithmetic avg: "   arithmeticAverage   "\nGeometric avg: "   geometricAverage   "\nMedian"   median;

      result.set(summary);
      context.write(key, result);
    }
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "number count");
    job.setJarByClass(NumCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

The problem with my code is, i am getting error that i can't put string into IntWritable (looks logic, but how I can parse string value to output?)

result.set(summary);

What is more, when i tried to do something like this:

result.set(median);

I didn't receive median value, instead I received bad output, which was the list of numbers from input file with "1" nearby.

I am totally green to hadoop and I don't have any clue how to do this right, any sugestions? ;x

CodePudding user response：

Since you have String summary, obviously the answer is to use Text rather than IntWritable... Don't use IntWritable if you have more than one value to return, multiple of which are not integers.

Also, this logic isn't even correct since all equal numbers end up in the same reducer, so the "maxNumber" would never be the overall max, for example, and you'd therefore have the same reducer output values as unique input values. The solution is to use NullWritable as the reducer key (and mapper key output), forcing all numbers into one reducer so they can be maxed/averaged/summed, etc. You also don't need numList since the Iterable<IntWritable> is already able to be iterated; you should only need one loop to do all the calculations, except maybe the median, where you'd need to sort the numbers first.

My personal suggestion would be use Spark or Hive to do statistical analysis rather than barebones Mapreduce...