Hive Join

Hive Join

Hive中的Join的用法

創建join示例所使用的表。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
-- 創建table a
hive> create table a(
id int,
name string
)row format delimited fields terminated by '\t';
OK
-- 從本地文件載入資料
hive> load data local inpath '/home/hadoop/data/a_join.txt' into table a;
Loading data to table default.a
Table default.a stats: [numFiles=1, totalSize=26]
OK
Time taken: 0.293 seconds
hive> select * from a;
OK
1 zhangsa
2 lisi
3 wangwu
Time taken: 0.055 seconds, Fetched: 3 row(s)
-- 創建table b
hive> create table b(
id int,
age int
) row format delimited fields terminated by '\t';
OK
-- 從本地文件載入資料
hive> load data local inpath '/home/hadoop/data/b_join.txt' into table b;
Loading data to table default.b
Table default.b stats: [numFiles=1, totalSize=15]
OK
Time taken: 0.241 seconds
hive> select * from b;
OK
1 28
2 19
4 21
Time taken: 0.089 seconds, Fetched: 3 row(s)

Inner Join

  • 只返回能關聯的紀錄。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    hive> select a.id, a.name, b.age from a join b on a.id=b.id;
    Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36
    Total jobs = 1
    17/09/16 17:37:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Execution log at: /opt/software/hive/log/hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36.log
    2017-09-16 05:37:53 Starting to launch local task to process map join; maximum memory = 518979584
    2017-09-16 05:37:55 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-37-49_384_3178169469256003866-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable
    2017-09-16 05:37:55 Uploaded 1 File to: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-37-49_384_3178169469256003866-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (320 bytes)
    2017-09-16 05:37:55 End of local task; Time Taken: 2.002 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since theres no reduce operator
    Job running in-process (local Hadoop)
    2017-09-16 17:37:56,968 Stage-3 map = 100%, reduce = 0%
    Ended Job = job_local1965457212_0036
    MapReduce Jobs Launched:
    Stage-Stage-3: HDFS Read: 55888 HDFS Write: 10506 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    1 zhangsa 28
    2 lisi 19
    Time taken: 7.589 seconds, Fetched: 2 row(s)

Left Join

  • 使用左表作為主表,返回紀錄數與左表相同。
  • 關聯不上的使用NULL表示。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    hive> select a.id, a.name, b.age from a left join b on a.id=b.id;
    Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36
    Total jobs = 1
    17/09/16 17:39:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Execution log at: /opt/software/hive/log/hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36.log
    2017-09-16 05:39:47 Starting to launch local task to process map join; maximum memory = 518979584
    2017-09-16 05:39:48 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-39-43_925_9210396379212280287-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable
    2017-09-16 05:39:48 Uploaded 1 File to: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-39-43_925_9210396379212280287-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (320 bytes)
    2017-09-16 05:39:48 End of local task; Time Taken: 0.912 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since theres no reduce operator
    Job running in-process (local Hadoop)
    2017-09-16 17:39:50,005 Stage-3 map = 100%, reduce = 0%
    Ended Job = job_local296218704_0037
    MapReduce Jobs Launched:
    Stage-Stage-3: HDFS Read: 55914 HDFS Write: 10506 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    1 zhangsa 28
    2 lisi 19
    3 wangwu NULL
    Time taken: 6.09 seconds, Fetched: 3 row(s)

Right Join

  • 使用右表作為主表,返回紀錄數與右表相同。
  • 關聯不上的使用NULL表示。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    hive> select a.id, a.name, b.age from a right join b on a.id=b.id;
    Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36
    Total jobs = 1
    17/09/16 17:40:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Execution log at: /opt/software/hive/log/hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36.log
    2017-09-16 05:40:41 Starting to launch local task to process map join; maximum memory = 518979584
    2017-09-16 05:40:42 Dump the side-table for tag: 0 with group count: 3 into file: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-40-37_662_8266497628801648044-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile30--.hashtable
    2017-09-16 05:40:42 Uploaded 1 File to: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-40-37_662_8266497628801648044-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile30--.hashtable (337 bytes)
    2017-09-16 05:40:42 End of local task; Time Taken: 1.125 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since theres no reduce operator
    Job running in-process (local Hadoop)
    2017-09-16 17:40:43,702 Stage-3 map = 100%, reduce = 0%
    Ended Job = job_local2139900301_0038
    MapReduce Jobs Launched:
    Stage-Stage-3: HDFS Read: 55929 HDFS Write: 10506 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    1 zhangsa 28
    2 lisi 19
    NULL NULL 21
    Time taken: 6.042 seconds, Fetched: 3 row(s)
    -- 最後一列的id為null而不是為4,是因為select語句中,是取用table a的id,而不是table b的id。

Full Join

  • 以兩個表的紀錄為基準,返回兩個表的紀錄去重之和。
  • 關聯不上的使用NULL表示。
  • 在此情況下,Hive不使用Map Join來優化。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    hive> select a.id, a.name, b.age from a full join b on a.id=b.id;
    Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36
    Total jobs = 1
    Launching Job 1 out of 1
    Number of reduce tasks not specified. Estimated from input data size: 1
    In order to change the average load for a reducer (in bytes):
    set hive.exec.reducers.bytes.per.reducer=<number>
    In order to limit the maximum number of reducers:
    set hive.exec.reducers.max=<number>
    In order to set a constant number of reducers:
    set mapreduce.job.reduces=<number>
    Job running in-process (local Hadoop)
    2017-09-16 17:42:15,465 Stage-1 map = 0%, reduce = 0%
    2017-09-16 17:42:16,467 Stage-1 map = 100%, reduce = 100%
    Ended Job = job_local1292867757_0039
    MapReduce Jobs Launched:
    Stage-Stage-1: HDFS Read: 167895 HDFS Write: 31518 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    1 zhangsa 28
    2 lisi 19
    3 wangwu NULL
    NULL NULL 21
    Time taken: 2.285 seconds, Fetched: 4 row(s)

Cross Join

  • 返回兩個表的笛卡兒積(Cartesian product)結果
  • 不需指定關聯鍵,及不需使用on關鍵字。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    hive> select a.id, a.name, b.age from a cross join b;
    Warning: Map Join MAPJOIN[7][bigTable=a] in task 'Stage-3:MAPRED' is a cross product
    Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36
    Total jobs = 1
    17/09/16 17:44:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    Execution log at: /opt/software/hive/log/hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36.log
    2017-09-16 05:44:23 Starting to launch local task to process map join; maximum memory = 518979584
    2017-09-16 05:44:23 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-44-19_720_6308228100696237254-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile51--.hashtable
    2017-09-16 05:44:24 Uploaded 1 File to: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-44-19_720_6308228100696237254-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile51--.hashtable (290 bytes)
    2017-09-16 05:44:24 End of local task; Time Taken: 0.961 sec.
    Execution completed successfully
    MapredLocal task succeeded
    Launching Job 1 out of 1
    Number of reduce tasks is set to 0 since theres no reduce operator
    Job running in-process (local Hadoop)
    2017-09-16 17:44:25,491 Stage-3 map = 100%, reduce = 0%
    Ended Job = job_local1992175647_0041
    MapReduce Jobs Launched:
    Stage-Stage-3: HDFS Read: 56022 HDFS Write: 10506 SUCCESS
    Total MapReduce CPU Time Spent: 0 msec
    OK
    1 zhangsa 28
    1 zhangsa 19
    1 zhangsa 21
    2 lisi 28
    2 lisi 19
    2 lisi 21
    3 wangwu 28
    3 wangwu 19
    3 wangwu 21
    Time taken: 5.781 seconds, Fetched: 9 row(s)

Hive join

  • Hive中的join,分成下列兩種:
    • Common join: 又稱Shuffle join、Reduce join
    • Map join: 又稱Broadcast join
  • hive-site.xml相關屬性:
Property Description Default Value
hive.auto.convert.join 是否轉換為Map Join false in 0.7.0 to 0.10.0; true in 0.11.0 and later
hive.mapjoin.smalltable.filesize 判斷為小表的上限依據 25MB
hive.auto.convert.join.noconditionaltask 是否將多個Map join合併為一個 true
hive.auto.convert.join.noconditionaltask.size 多個Map join轉換為1個時,所有小表文件大小總和的最大值 10MB

Common Join

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
-- 此屬性預設為true,若要啟動Common Join需先設定為false
hive> set hive.auto.convert.join=false;
hive> set hive.auto.convert.join;
hive.auto.convert.join=false
-- 顯示查詢語句的執行計畫
hive> explain select a.id, a.name, b.age from a join b on a.id=b.id;
OK
STAGE DEPENDENCIES:
Stage-1 is a root stage
Stage-0 depends on stages: Stage-1
STAGE PLANS:
Stage: Stage-1
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
value expressions: name (type: string)
TableScan
alias: b
Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
value expressions: age (type: int)
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col1, _col6
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col6 (type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.138 seconds, Fetched: 59 row(s)
執行計畫說明

上面為Common Join的執行計畫,下面將分段解釋所代表意思。

  • 共生成2個Stage,且需要先執行完Stage-1,才能執行Stage-0。
  • Stage-1區段中的Mapreduce關鍵字說明此為一個MapReduece Job,分別描述在Map Operator Tree與Reduce Operator Tree區段。
1
2
3
4
5
6
7
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
  • Map Operator Tree
    • TableScan: 進行讀表操作。
    • alias: 目標表的名稱。
    • Statistics: 讀入數據的相關資訊。
    • Filter Operator: 雖然SQL語句中不包含Where,但是join語句中隱藏一個過濾條件(on關鍵字)。
      • predicate: join條件。
1
2
3
4
5
6
Reduce Output Operator
key expressions: id (type: int)
sort order: +
Map-reduce partition columns: id (type: int)
Statistics: Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
value expressions: name (type: string)
  • Reduce Output Operator
    • 及為Mapper的輸出。
    • key expressions: 對應join語句中on的對象。
    • value expressions:value是一系列的字段,對應select語句。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
Reduce Operator Tree:
Join Operator
condition map:
Inner Join 0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col1, _col6
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col6 (type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
  • Reduce Operator
    • Inner Join 0 to 1: 此為一個inner join操作。
    • keys: join操作時,各表所使用的的key。
    • Select Operator: reduce結果輸出的相關訊息。
      • compressed: 是否使用壓縮。
      • Statistics: 輸出數據的相關統計訊息。
      • input/output format: 輸出入檔案的格式。
      • serde: 序列化物件。
小結

上面為Common Join的流程圖以及執行計畫說明,在MapReduce的執行common join過程中,需要Shuffle進行重新洗牌的操作,也就是將key進行Hash處理後相同的值,放置在一塊,作為Reduce的輸入。這個過程中如果傳輸的量很大會造成一定的網路資源開銷,且可能造成OOM的問題,導致計算任務失敗。
若Shuffle是計算瓶頸所在,而Shuffle是為Reduce的前置作業,如何去除掉Shuffle,又能達到Reduce的結果。故Map Join就是用來解決此問題。

Map Join

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
-- 將此屬性設為true,啟用Map Join
hive> set hive.auto.convert.join=true;
hive> set hive.auto.convert.join;
hive.auto.convert.join=true
hive> explain select a.id, a.name, b.age from a join b on a.id=b.id;
OK
STAGE DEPENDENCIES:
Stage-4 is a root stage
Stage-3 depends on stages: Stage-4
Stage-0 depends on stages: Stage-3
STAGE PLANS:
Stage: Stage-4
Map Reduce Local Work
Alias -> Map Local Tables:
b
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
b
TableScan
alias: b
Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 id (type: int)
1 id (type: int)
Stage: Stage-3
Map Reduce
Map Operator Tree:
TableScan
alias: a
Statistics: Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 26 Basic stats: COMPLETE Column stats: NONE
Map Join Operator
condition map:
Inner Join 0 to 1
keys:
0 id (type: int)
1 id (type: int)
outputColumnNames: _col0, _col1, _col6
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
Select Operator
expressions: _col0 (type: int), _col1 (type: string), _col6 (type: int)
outputColumnNames: _col0, _col1, _col2
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
File Output Operator
compressed: false
Statistics: Num rows: 1 Data size: 28 Basic stats: COMPLETE Column stats: NONE
table:
input format: org.apache.hadoop.mapred.TextInputFormat
output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Local Work:
Map Reduce Local Work
Stage: Stage-0
Fetch Operator
limit: -1
Processor Tree:
ListSink
Time taken: 0.16 seconds, Fetched: 62 row(s)
執行計畫說明

上面為Map Join的執行計畫,下面將分段解釋所代表意思。

  • 此執行計畫共分成3個Stage,依次為Stage-4、Stage-3、Stage-0,且由後往前為依賴關係,需要先執行完Stage-4,才能執行Stage-3,最後執行Stage-0。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
Map Reduce Local Work
Alias -> Map Local Tables:
b
Fetch Operator
limit: -1
Alias -> Map Local Operator Tree:
b
TableScan
alias: b
Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
Filter Operator
predicate: id is not null (type: boolean)
Statistics: Num rows: 1 Data size: 15 Basic stats: COMPLETE Column stats: NONE
HashTable Sink Operator
keys:
0 id (type: int)
  • Map Reduce Local Work: 啟動一個MapReduce Local Task。對應圖中的MapReduce Local Task方框。
  • TableScan
    • alias: b: 根據MetaData中的統計訊息自動決定b表為小表,並且進行讀入數據。
    • HashTable Sink Operator: 輸出HashTable並存到HDFS上,對應圖中的HashTable File。
1
2
3
Stage: Stage-3
Map Reduce
Map Operator Tree:
  • Stage: Stage-3: 為一個MapReduce Job,但其只包含Map Task,並將計算結果輸出至HDFS上。
小結
  • 官方文檔 JoinOptimization
  • 上圖為Map Join的流程圖以及執行計畫,Map Join會將小表全部讀入記憶體中,並產生HashTable檔案,接著將HashTable加載到分布式緩存中。在map階段,直接拿HashTable與大表中的紀錄做匹配。
  • 由於在map中進行了join的操作,省去了shuffle與reduce的過程,執行效率提高很多,並且省去Shuffle階段要進行的大量資料傳輸,達成優化作業的作用。
  • 使用Map Join時要注意小表的大小,因為會將小表全部寫入分布式記憶體中,故當所需記憶體過大時,會發生OOM。而小表的上限判定值,可以根據屬性來設定。
  • 要使Map Join能夠順利進行,必須滿足: 除了一份表的資料分佈在不同的Map中之外,其他連接的表的資料必須在每個Map中有完整的複製。

Hive log

  • 查看執行Common join跟Map Join對應的Hive Log。
  • Hive Log的位置在$HIVE_HOME/conf/hive-log4j.properties中的以下參數設定:
    • hive.log.dir=${java.io.tmpdir}/${user.name}
    • hive.log.file=hive.log

Common Join Log

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
2017-09-23 18:01:03,204 INFO [main]: ql.Driver (Driver.java:compile(411)) - Compiling command(queryId=hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56): select a.id, a.name, b.age from a join b on a.id=b.id
2017-09-23 18:01:03,334 INFO [main]: ql.Driver (Driver.java:compile(463)) - Semantic Analysis Completed
2017-09-23 18:01:03,334 INFO [main]: ql.Driver (Driver.java:getSchema(245)) - Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a.id, type:int, comment:null), FieldSchema(name:a.name, type:string, comment:null), FieldSchema(name:b.age, type:int, comment:null)], properties:null)
2017-09-23 18:01:03,345 INFO [main]: ql.Driver (Driver.java:compile(541)) - Completed compiling command(queryId=hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56); Time taken: 0.141 seconds
2017-09-23 18:01:03,346 INFO [main]: ql.Driver (Driver.java:checkConcurrency(165)) - Concurrency mode is disabled, not creating a lock manager
2017-09-23 18:01:03,346 INFO [main]: ql.Driver (Driver.java:execute(1448)) - Executing command(queryId=hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56): select a.id, a.name, b.age from a join b on a.id=b.id
2017-09-23 18:01:03,346 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Query ID = hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56
2017-09-23 18:01:03,347 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Total jobs = 1
2017-09-23 18:01:03,347 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Launching Job 1 out of 1
2017-09-23 18:01:03,351 INFO [main]: ql.Driver (Driver.java:launchTask(1772)) - Starting task [Stage-1:MAPRED] in serial mode
2017-09-23 18:01:03,354 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Number of reduce tasks not specified. Estimated from input data size: 1
2017-09-23 18:01:03,354 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - In order to change the average load for a reducer (in bytes):
2017-09-23 18:01:03,355 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - set hive.exec.reducers.bytes.per.reducer=<number>
2017-09-23 18:01:03,355 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - In order to limit the maximum number of reducers:
2017-09-23 18:01:03,355 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - set hive.exec.reducers.max=<number>
2017-09-23 18:01:03,355 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - In order to set a constant number of reducers:
2017-09-23 18:01:03,356 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - set mapreduce.job.reduces=<number>
2017-09-23 18:01:03,357 INFO [main]: mr.ExecDriver (ExecDriver.java:execute(286)) - Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
2017-09-23 18:01:03,357 INFO [main]: mr.ExecDriver (ExecDriver.java:execute(308)) - adding libjars: file:///opt/software/hive/auxlib/com.hive.demo-1.0.jar
2017-09-23 18:01:03,382 ERROR [main]: mr.ExecDriver (ExecDriver.java:execute(398)) - local
2017-09-23 18:01:03,399 WARN [main]: mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(64)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-09-23 18:01:03,741 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Job running in-process (local Hadoop)
2017-09-23 18:01:04,757 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - 2017-09-23 18:01:04,757 Stage-1 map = 100%, reduce = 100%
2017-09-23 18:01:04,770 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Ended Job = job_local2008589424_0003
2017-09-23 18:01:04,776 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - MapReduce Jobs Launched:
2017-09-23 18:01:04,777 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Stage-Stage-1: HDFS Read: 264 HDFS Write: 0 SUCCESS
2017-09-23 18:01:04,777 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Total MapReduce CPU Time Spent: 0 msec
2017-09-23 18:01:04,777 INFO [main]: ql.Driver (Driver.java:execute(1696)) - Completed executing command(queryId=hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56); Time taken: 1.43 seconds
2017-09-23 18:01:04,777 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - OK
Common Join Log說明:
  • 2017-09-23 18:01:04,757 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - 2017-09-23 18:01:04,757 Stage-1 map = 100%, reduce = 100%
    • 需要進行Map與Reduce,也就是需要進行Shuffle操作。

Map Join Log

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
2017-09-23 18:00:04,948 INFO [main]: ql.Driver (Driver.java:compile(411)) - Compiling command(queryId=hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56): select a.id, a.name, b.age from a join b on a.id=b.id
2017-09-23 18:00:05,100 INFO [main]: ql.Driver (Driver.java:compile(463)) - Semantic Analysis Completed
2017-09-23 18:00:05,102 INFO [main]: ql.Driver (Driver.java:getSchema(245)) - Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:a.id, type:int, comment:null), FieldSchema(name:a.name, type:string, comment:null), FieldSchema(name:b.age, type:int, comment:null)], properties:null)
2017-09-23 18:00:05,112 INFO [main]: ql.Driver (Driver.java:compile(541)) - Completed compiling command(queryId=hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56); Time taken: 0.164 seconds
2017-09-23 18:00:05,112 INFO [main]: ql.Driver (Driver.java:checkConcurrency(165)) - Concurrency mode is disabled, not creating a lock manager
2017-09-23 18:00:05,112 INFO [main]: ql.Driver (Driver.java:execute(1448)) - Executing command(queryId=hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56): select a.id, a.name, b.age from a join b on a.id=b.id
2017-09-23 18:00:05,113 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Query ID = hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56
2017-09-23 18:00:05,120 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Total jobs = 1
2017-09-23 18:00:05,122 INFO [main]: ql.Driver (Driver.java:launchTask(1772)) - Starting task [Stage-4:MAPREDLOCAL] in serial mode
2017-09-23 18:00:05,123 INFO [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(159)) - Generating plan file file:/tmp/hadoop/6085132e-5e5c-465b-8e67-dd6b58d6518d/hive_2017-09-23_18-00-04_956_7187995778853783363-1/-local-10005/plan.xml
2017-09-23 18:00:05,278 INFO [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(288)) - Executing: /opt/software/hadoop/bin/hadoop jar /opt/software/hive-1.1.0-cdh5.7.0/lib/hive-exec-1.1.0-cdh5.7.0.jar org.apache.hadoop.hive.ql.exec.mr.ExecDriver -localtask -plan file:/tmp/hadoop/6085132e-5e5c-465b-8e67-dd6b58d6518d/hive_2017-09-23_18-00-04_956_7187995778853783363-1/-local-10005/plan.xml -jobconffile file:/tmp/hadoop/6085132e-5e5c-465b-8e67-dd6b58d6518d/hive_2017-09-23_18-00-04_956_7187995778853783363-1/-local-10006/jobconf.xml
2017-09-23 18:00:10,522 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Execution completed successfully
2017-09-23 18:00:10,522 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - MapredLocal task succeeded
2017-09-23 18:00:10,522 INFO [main]: mr.MapredLocalTask (MapredLocalTask.java:executeInChildVM(313)) - Execution completed successfully
2017-09-23 18:00:10,523 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Launching Job 1 out of 1
2017-09-23 18:00:10,526 INFO [main]: ql.Driver (Driver.java:launchTask(1772)) - Starting task [Stage-3:MAPRED] in serial mode
2017-09-23 18:00:10,526 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Number of reduce tasks is set to 0 since theres no reduce operator
2017-09-23 18:00:10,536 INFO [main]: mr.ExecDriver (ExecDriver.java:execute(286)) - Using org.apache.hadoop.hive.ql.io.CombineHiveInputFormat
2017-09-23 18:00:10,537 INFO [main]: mr.ExecDriver (ExecDriver.java:execute(308)) - adding libjars: file:///opt/software/hive/auxlib/com.hive.demo-1.0.jar
2017-09-23 18:00:10,545 ERROR [main]: mr.ExecDriver (ExecDriver.java:execute(398)) - local
2017-09-23 18:00:10,564 WARN [main]: mapreduce.JobResourceUploader (JobResourceUploader.java:uploadFiles(64)) - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2017-09-23 18:00:10,963 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Job running in-process (local Hadoop)
2017-09-23 18:00:11,969 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - 2017-09-23 18:00:11,969 Stage-3 map = 100%, reduce = 0%
2017-09-23 18:00:11,977 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Ended Job = job_local1147667645_0002
2017-09-23 18:00:11,981 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - MapReduce Jobs Launched:
2017-09-23 18:00:11,981 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Stage-Stage-3: HDFS Read: 52 HDFS Write: 0 SUCCESS
2017-09-23 18:00:11,981 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - Total MapReduce CPU Time Spent: 0 msec
2017-09-23 18:00:11,981 INFO [main]: ql.Driver (Driver.java:execute(1696)) - Completed executing command(queryId=hadoop_20170923175858_4f8377cd-a648-4c9e-8190-b0f7d4326f56); Time taken: 6.869 seconds
2017-09-23 18:00:11,982 INFO [main]: ql.Driver (SessionState.java:printInfo(927)) - OK
Map Join Log說明:
  • 2017-09-23 18:00:05,122 INFO [main]: ql.Driver (Driver.java:launchTask(1772)) - Starting task [Stage-4:MAPREDLOCAL] in serial mode

    • 此為一個MapReduce Local Task。對應Map Join圖中的MapReduce Local Task方框。
  • 2017-09-23 18:00:10,526 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Number of reduce tasks is set to 0 since theres no reduce operator

    • 此操作沒有Reduce Task