What is mapreduce_mapreduce working principle _mapreduce

Mapreduce Overview

MapReduce is a programming model for parallel operations on large data sets (greater than 1TB). The concepts "Map" and "Reduce" are their main ideas, borrowed from functional programming languages, and borrowed from vector programming languages. It greatly facilitates programmers to run their own programs on distributed systems without distributed parallel programming. The current software implementation specifies a Map function that maps a set of key-value pairs into a new set of key-value pairs, specifying a concurrent Reduce function to ensure that all mapped key-value pairs are centered. Each one shares the same key group.

1. MapReduce is a distributed computing model proposed by Google. It is mainly used in the search field to solve the calculation problem of massive data.

2. MR has two phases: Map and Reduce. Users only need to implement map() and reduce() functions to implement distributed computing.

MapReduce execution process

What is mapreduce_mapreduce working principle _mapreduce execution flow _mapreduce logical model diagram

MapReduce principle

What is mapreduce_mapreduce working principle _mapreduce execution flow _mapreduce logical model diagram

MapReduce execution steps: 1, Map task processing

1.1 Read files in HDFS. Each line is parsed into a "k, v". The map function is called once for each key-value pair. "0, hello you" "10, hello me"

1.2 Overwrite map(), receive the "k,v" generated by 1.1, process it, and convert it to the new "k,v" output. "hello, 1" "you, 1" "hello, 1" "me, 1"

1.3 Partition the "k, v" of the 1.2 output. The default is divided into one area. See "ParTITIoner" for details.

1.4 Sort the data in different partitions (according to k), grouping. Grouping refers to the value of the same key placed in a collection. After sorting: "hello, 1" "hello, 1" "me, 1" "you, 1" After grouping: "hello, {1, 1}" "me, {1}" "you, {1}"

1.5 (Optional) Reduce the data after grouping. See "Combiner" for details.

2,uce task processing

2.1 The output of multiple map tasks is copied to different reduce nodes through the network according to different partitions. (shuffle) See "shuffle process analysis" for details.

2.2 Combine and sort the output of multiple maps. Override the reduce function, receive the grouped data, and implement your own business logic, "hello, 2" "me, 1" "you, 1"

After processing, a new "k, v" output is generated.

2.3 Write the "k,v" of the reduce output to HDFS.

Java code implementation

Note: To import org.apache.hadoop.fs.FileUTIl.java.

1. Create a hello file and upload it to HDFS.

Java code implementation

Note: To import org.apache.hadoop.fs.FileUTIl.java.

1. Create a hello file and upload it to HDFS.

Figure III

2, then write the code to achieve the number of words in the file statistics (code in the code is commented out, can be omitted, not omitted)

1package mapreduce;

2

3 import java.net.URI;

4 import org.apache.hadoop.conf.Configuration;

5 import org.apache.hadoop.fs.FileSystem;

6 import org.apache.hadoop.fs.Path;

7 import org.apache.hadoop.io.LongWritable;

8 import org.apache.hadoop.io.Text;

9 import org.apache.hadoop.mapreduce.Job;

10 import org.apache.hadoop.mapreduce.Mapper;

11 import org.apache.hadoop.mapreduce.Reducer;

12 import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

13 import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;

14 import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

15 import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;

17 public class WordCountApp {

18 static final String INPUT_PATH = "hdfs://chaoren:9000/hello";

19 static final String OUT_PATH = â€œhdfs://chaoren:9000/outâ€;

21 public static void main(String[] args) throws Exception {

22 Configuration conf = new Configuration();

23 FileSystem fileSystem = FileSystem.get(new URI(INPUT_PATH), conf);

24 Path outPath = new Path(OUT_PATH);

25 if (fileSystem.exists(outPath)) {

26 fileSystem.delete(outPath, true);

27 }

29 Job job = new Job(conf, WordCountApp.class.getSimpleName());

31 // 1.1 specifies where the read file is located

32 FileInputFormat.setInputPaths(job, INPUT_PATH);

33 // Specify how to format the input file and parse each line of the input file into a key-value pair

34 //job.setInputFormatClass(TextInputFormat.class);

36 // 1.2 Specify a custom map class

37 job.setMapperClass(MyMapper.class);

38 // The "k,v" type of map output. If the type of "k3, v3" is the same as the type of "k2, v2", it can be omitted.

39 //job.setOutputKeyClass(Text.class);

40 //job.setOutputValueClass(LongWritable.class);

42 // 1.3 partition

43 //job.setPartitionerClass(org.apache.hadoop.mapreduce.lib.partition.HashPartitioner.class);

44 // There is a reduce task running

45 //job.setNumReduceTasks(1);

47 // 1.4 sorting, grouping

49 // 1.5 reduction

51 // 2.2 Specify custom reduce class

52 job.setReducerClass(MyReducer.class);

53 // Specify the output type of reduce

54 job.setOutputKeyClass(Text.class);

55 job.setOutputValueClass(LongWritable.class);

57 // 2.3 Specify where to write

58 FileOutputFormat.setOutputPath(job, outPath);

59 // Specify the formatting class of the output file

60 //job.setOutputFormatClass(TextOutputFormat.class);

62 // Submit the job to jobtracker to run

63 job.waitForCompletion(true);

64 }

66 /**

67 *

68 * KEYIN ie K1 represents the offset of the line

69 * VALUEIN or V1 means line text content

70 * KEYOUT ie K2 means words appearing in the line

71 * VALUEOUT or V2 indicates the number of times a word appears in the line, a fixed value of 1

72 *

73 */

74 static class MyMapper extends

75 Mapper LongWritable, Text, Text, LongWritable {

76 protected void map(LongWritable k1, Text v1, Context context)

77 throws java.io.IOException, InterruptedException {

78 String[] splited = v1.toString().split("");

79 for (String word : splited) {

80 context.write(new Text(word), new LongWritable(1));

81 }

82 };

83 }

85 /**

86 * KEYIN ie K2 means words appearing in the line

87 * VALUEIN or V2 indicates the number of words that appear

88 * KEYOUT ie K3 means different words appearing in the line

89 * VALUEOUT or V3 indicates the total number of different words appearing in the line

90 */

91 static class MyReducer extends

92 Reducer "Text, LongWritable, Text, LongWritable" {

93 protected void reduce(Text k2, java.lang.Iterableã€ŠLongWritableã€‹ v2s,

94 Context ctx) throws java.io.IOException,

95 InterruptedException {

96 long times = 0L;

97 for (LongWritable count : v2s) {

98 times += count.get();

99 }

100 ctx.write(k2, new LongWritable(times));

101 };

102 }

103 }

3, after running successfully, you can view the results of the operation in Linux

Figure 4

MapReduce main features

1) Data partitioning and computing task scheduling:

The system automatically divides the big data to be processed by a job into a plurality of data blocks, each data block corresponds to a calculation task (Task), and automatically schedules the calculation node to process the corresponding data block. The job and task scheduling functions are mainly responsible for allocating and scheduling compute nodes (Map nodes or Reduce nodes), and are responsible for monitoring the execution status of these nodes and for the synchronization control performed by the Map nodes.

2) Data/code mutual positioning:

In order to reduce data communication, a basic principle is localized data processing, that is, a computing node processes the data stored on its local disk as much as possible, which realizes the migration of code to data; when such localized data processing cannot be performed , then look for other available nodes and upload data from the network to the node (data to code migration), but will try to find available nodes from the local rack where the data is located to reduce communication delay.

3) System optimization:

In order to reduce the data communication overhead, the intermediate result data will be merged before entering the Reduce node; the data processed by one Reduce node may come from multiple Map nodes. In order to avoid data correlation in the Reduce calculation phase, the Map node outputs the middle. As a result, a certain strategy is needed to perform appropriate partitioning processing to ensure that the correlation data is sent to the same Reduce node. In addition, the system performs some calculation performance optimization processing, such as using multiple backup executions for the slowest computing tasks, and selecting the fastest completion. As a result.

4) Error detection and recovery:

In a large-scale MapReduce computing cluster consisting of low-end commercial servers, node hardware (host, disk, memory, etc.) errors and software errors are normal, so MapReduce needs to be able to detect and isolate faulty nodes, and schedule new nodes to take over error nodes. Computational tasks. At the same time, the system will also maintain the reliability of data storage, improve the reliability of data storage with multiple backup redundant storage mechanisms, and timely detect and recover erroneous data.

2.00MM VDSL Cable Connector

2.00MM VDSL Cable Connector.SCSI Connector Series.

2.00MM VDSL Cable Connector

ShenZhen Antenk Electronics Co,Ltd , https://www.antenkelec.com