分布式lab3：词频统计

Posted on 2019-04-07 Edited on 2023-04-05 In distributed system Views:

使用hadoop写一次词频统计的demo。

具体的操作细节有大佬已经写好了wiki，有需要请移步，我这里只分析java代码细节。

全部代码

import java.io.IOException;
import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {
  public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString());
      while (itr.hasMoreTokens()) {
        word.set(itr.nextToken());
        context.write(word, one);
      }
    }
  }

  public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
  	private IntWritable result = new IntWritable();

  	public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
  		int sum = 0;
  		for (IntWritable val : values) {
  			sum += val.get();
  		}
  		result.set(sum);
  		context.write(key, result);
  	}
  }

  public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    Job job = Job.getInstance(conf, "word count");
    job.setJarByClass(WordCount.class);
    job.setMapperClass(TokenizerMapper.class);
    job.setCombinerClass(IntSumReducer.class);
    job.setReducerClass(IntSumReducer.class);
    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);
    FileInputFormat.addInputPath(job, new Path(args[0]));
    FileOutputFormat.setOutputPath(job, new Path(args[1]));
    System.exit(job.waitForCompletion(true) ? 0 : 1);
  }
}

Map操作

public static class TokenizerMapper extends Mapper<Object, Text, Text, IntWritable>{

  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  public void map(Object key, Text value, Context context) throws IOException, InterruptedException {
    StringTokenizer itr = new StringTokenizer(value.toString());
    while (itr.hasMoreTokens()) {
      word.set(itr.nextToken());
      context.write(word, one);
    }
  }
}

自己定义一个类，这里取名为TokenizerMapper。这个类继承自Mapper类，继承时要指定4个泛型，分别表示：

输入键key的参数类型
输入值value的参数类型
输出键key的参数类型
输出值value的参数类型

比如这里，输入的是文本信息，输入的value是文本中的一行文字，类型为Text，而输入的key表示该行首字母相对于文本文件首地址的偏移量，类型为java最大的类Object。而输出的信息是词频信息，key表示一个单词，类型为Text，value表示其词频，类型为IntWritabe。

第5行定义了一个常量对象one，就表示数量1，后面出现一个单词就记录出现1次。

第6行开始实现了map方法，参数有输入信息的key和value，还有上下文context。

第7行new了一个StringTokenizer类的实例itr，在构造对象时，就把value的字符创按分隔符分成了一个个单词。

itr.hasMoreTokens()表示是否后面还有单词
itr.nextToken()表示下一个单词

context.write(word, one);表示将这个单词的词频记为1，写入context用以记录。

Reduce操作

public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {
	private IntWritable result = new IntWritable();

	public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
		int sum = 0;
		for (IntWritable val : values) {
			sum += val.get();
		}
		result.set(sum);
		context.write(key, result);
	}
}

自己定义一个类，取名为IntSumReducer，继承自Reduce类，同样指定4个泛型，分别表示：

输入键key的参数类型
输入值value的参数类型
输出键key的参数类型
输出值value的参数类型

在这里输入是Map操作的输出，即词频信息，而输出是整合好的词频信息。

第4行实现reduce方法，参数key是单词，Iterable<IntWritable> values是一个可迭代的集合，表示这个单词所有的词频信息，context是上下文。

5到8行遍历统计词频，用sum来计数，最后在第10行写入到这个key的value。

main函数

public static void main(String[] args) throws Exception {
  Configuration conf = new Configuration();
  Job job = Job.getInstance(conf, "word count");
  job.setJarByClass(WordCount.class);
  job.setMapperClass(TokenizerMapper.class);
  job.setCombinerClass(IntSumReducer.class);
  job.setReducerClass(IntSumReducer.class);
  job.setOutputKeyClass(Text.class);
  job.setOutputValueClass(IntWritable.class);
  FileInputFormat.addInputPath(job, new Path(args[0]));
  FileOutputFormat.setOutputPath(job, new Path(args[1]));
  System.exit(job.waitForCompletion(true) ? 0 : 1);
}

Configuration conf = new Configuration();从Hadoop的配置文件里读取参数；
job.setJarByClass(WordCount.class);根据WordCount类的位置设置Jar文件；
job.setMapperClass(TokenizerMapper.class); 设置Mapper ；
job.setCombinerClass(IntSumReducer.class);这句代码要提一下，因为如果所有的slave都分别做map操作，然后把信息全部返回给master节点，导致master节点负载很大，也会加大网络通信量。所以这个combiner操作相当于slave节点上自己先做一次reduce操作，再把信息传给master节点reduce，有助于提高性能；
job.setReducerClass(IntSumReducer.class);设置Reduce；
job.setOutputKeyClass(Text.class);和job.setOutputValueClass(IntWritable.class); 分别设置输出key的类型和value的类型；
FileInputFormat.addInputPath(job, new Path(args[0]));设置输入文件，它是args第一个参数；
FileOutputFormat.setOutputPath(job, new Path(args[1]));设置输出文件，将输出结果写入这个文件里，它是args第二个参数 ;
System.exit(job.waitForCompletion(true) ? 0 : 1);等待执行结果，成功执行就退出码设置为0，否则为1。