Hive Join
Hive中的Join的用法
創建join示例所使用的表。
|
|
Inner Join
- 只返回能關聯的紀錄。1234567891011121314151617181920212223hive> select a.id, a.name, b.age from a join b on a.id=b.id;Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36Total jobs = 117/09/16 17:37:52 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableExecution log at: /opt/software/hive/log/hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36.log2017-09-16 05:37:53 Starting to launch local task to process map join; maximum memory = 5189795842017-09-16 05:37:55 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-37-49_384_3178169469256003866-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable2017-09-16 05:37:55 Uploaded 1 File to: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-37-49_384_3178169469256003866-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile11--.hashtable (320 bytes)2017-09-16 05:37:55 End of local task; Time Taken: 2.002 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks is set to 0 since theres no reduce operatorJob running in-process (local Hadoop)2017-09-16 17:37:56,968 Stage-3 map = 100%, reduce = 0%Ended Job = job_local1965457212_0036MapReduce Jobs Launched:Stage-Stage-3: HDFS Read: 55888 HDFS Write: 10506 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK1 zhangsa 282 lisi 19Time taken: 7.589 seconds, Fetched: 2 row(s)
Left Join
- 使用左表作為主表,返回紀錄數與左表相同。
- 關聯不上的使用NULL表示。123456789101112131415161718192021222324hive> select a.id, a.name, b.age from a left join b on a.id=b.id;Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36Total jobs = 117/09/16 17:39:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableExecution log at: /opt/software/hive/log/hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36.log2017-09-16 05:39:47 Starting to launch local task to process map join; maximum memory = 5189795842017-09-16 05:39:48 Dump the side-table for tag: 1 with group count: 3 into file: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-39-43_925_9210396379212280287-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable2017-09-16 05:39:48 Uploaded 1 File to: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-39-43_925_9210396379212280287-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile21--.hashtable (320 bytes)2017-09-16 05:39:48 End of local task; Time Taken: 0.912 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks is set to 0 since theres no reduce operatorJob running in-process (local Hadoop)2017-09-16 17:39:50,005 Stage-3 map = 100%, reduce = 0%Ended Job = job_local296218704_0037MapReduce Jobs Launched:Stage-Stage-3: HDFS Read: 55914 HDFS Write: 10506 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK1 zhangsa 282 lisi 193 wangwu NULLTime taken: 6.09 seconds, Fetched: 3 row(s)
Right Join
- 使用右表作為主表,返回紀錄數與右表相同。
- 關聯不上的使用NULL表示。12345678910111213141516171819202122232425hive> select a.id, a.name, b.age from a right join b on a.id=b.id;Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36Total jobs = 117/09/16 17:40:40 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableExecution log at: /opt/software/hive/log/hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36.log2017-09-16 05:40:41 Starting to launch local task to process map join; maximum memory = 5189795842017-09-16 05:40:42 Dump the side-table for tag: 0 with group count: 3 into file: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-40-37_662_8266497628801648044-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile30--.hashtable2017-09-16 05:40:42 Uploaded 1 File to: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-40-37_662_8266497628801648044-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile30--.hashtable (337 bytes)2017-09-16 05:40:42 End of local task; Time Taken: 1.125 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks is set to 0 since theres no reduce operatorJob running in-process (local Hadoop)2017-09-16 17:40:43,702 Stage-3 map = 100%, reduce = 0%Ended Job = job_local2139900301_0038MapReduce Jobs Launched:Stage-Stage-3: HDFS Read: 55929 HDFS Write: 10506 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK1 zhangsa 282 lisi 19NULL NULL 21Time taken: 6.042 seconds, Fetched: 3 row(s)-- 最後一列的id為null而不是為4,是因為select語句中,是取用table a的id,而不是table b的id。
Full Join
- 以兩個表的紀錄為基準,返回兩個表的紀錄去重之和。
- 關聯不上的使用NULL表示。
- 在此情況下,Hive不使用Map Join來優化。123456789101112131415161718192021222324hive> select a.id, a.name, b.age from a full join b on a.id=b.id;Query ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36Total jobs = 1Launching Job 1 out of 1Number of reduce tasks not specified. Estimated from input data size: 1In order to change the average load for a reducer (in bytes):set hive.exec.reducers.bytes.per.reducer=<number>In order to limit the maximum number of reducers:set hive.exec.reducers.max=<number>In order to set a constant number of reducers:set mapreduce.job.reduces=<number>Job running in-process (local Hadoop)2017-09-16 17:42:15,465 Stage-1 map = 0%, reduce = 0%2017-09-16 17:42:16,467 Stage-1 map = 100%, reduce = 100%Ended Job = job_local1292867757_0039MapReduce Jobs Launched:Stage-Stage-1: HDFS Read: 167895 HDFS Write: 31518 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK1 zhangsa 282 lisi 193 wangwu NULLNULL NULL 21Time taken: 2.285 seconds, Fetched: 4 row(s)
Cross Join
- 返回兩個表的笛卡兒積(Cartesian product)結果
- 不需指定關聯鍵,及不需使用on關鍵字。12345678910111213141516171819202122232425262728293031hive> select a.id, a.name, b.age from a cross join b;Warning: Map Join MAPJOIN[7][bigTable=a] in task 'Stage-3:MAPRED' is a cross productQuery ID = hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36Total jobs = 117/09/16 17:44:22 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicableExecution log at: /opt/software/hive/log/hadoop_20170914014141_a03e8794-0c51-414f-87d7-50236e8e1b36.log2017-09-16 05:44:23 Starting to launch local task to process map join; maximum memory = 5189795842017-09-16 05:44:23 Dump the side-table for tag: 1 with group count: 1 into file: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-44-19_720_6308228100696237254-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile51--.hashtable2017-09-16 05:44:24 Uploaded 1 File to: file:/tmp/hadoop/2d0fe1f1-954f-4cb7-bc14-1f05d5de3555/hive_2017-09-16_17-44-19_720_6308228100696237254-1/-local-10003/HashTable-Stage-3/MapJoin-mapfile51--.hashtable (290 bytes)2017-09-16 05:44:24 End of local task; Time Taken: 0.961 sec.Execution completed successfullyMapredLocal task succeededLaunching Job 1 out of 1Number of reduce tasks is set to 0 since theres no reduce operatorJob running in-process (local Hadoop)2017-09-16 17:44:25,491 Stage-3 map = 100%, reduce = 0%Ended Job = job_local1992175647_0041MapReduce Jobs Launched:Stage-Stage-3: HDFS Read: 56022 HDFS Write: 10506 SUCCESSTotal MapReduce CPU Time Spent: 0 msecOK1 zhangsa 281 zhangsa 191 zhangsa 212 lisi 282 lisi 192 lisi 213 wangwu 283 wangwu 193 wangwu 21Time taken: 5.781 seconds, Fetched: 9 row(s)
Hive join
- Hive中的join,分成下列兩種:
- Common join: 又稱Shuffle join、Reduce join
- Map join: 又稱Broadcast join
- hive-site.xml相關屬性:
Property | Description | Default Value |
---|---|---|
hive.auto.convert.join | 是否轉換為Map Join | false in 0.7.0 to 0.10.0; true in 0.11.0 and later |
hive.mapjoin.smalltable.filesize | 判斷為小表的上限依據 | 25MB |
hive.auto.convert.join.noconditionaltask | 是否將多個Map join合併為一個 | true |
hive.auto.convert.join.noconditionaltask.size | 多個Map join轉換為1個時,所有小表文件大小總和的最大值 | 10MB |
Common Join
|
|
執行計畫說明
上面為Common Join的執行計畫,下面將分段解釋所代表意思。
- 共生成2個Stage,且需要先執行完Stage-1,才能執行Stage-0。
- Stage-1區段中的Mapreduce關鍵字說明此為一個MapReduece Job,分別描述在Map Operator Tree與Reduce Operator Tree區段。
|
|
- Map Operator Tree
- TableScan: 進行讀表操作。
- alias: 目標表的名稱。
- Statistics: 讀入數據的相關資訊。
- Filter Operator: 雖然SQL語句中不包含Where,但是join語句中隱藏一個過濾條件(on關鍵字)。
- predicate: join條件。
|
|
- Reduce Output Operator
- 及為Mapper的輸出。
- key expressions: 對應join語句中on的對象。
- value expressions:value是一系列的字段,對應select語句。
|
|
- Reduce Operator
- Inner Join 0 to 1: 此為一個inner join操作。
- keys: join操作時,各表所使用的的key。
- Select Operator: reduce結果輸出的相關訊息。
- compressed: 是否使用壓縮。
- Statistics: 輸出數據的相關統計訊息。
- input/output format: 輸出入檔案的格式。
- serde: 序列化物件。
小結
上面為Common Join的流程圖以及執行計畫說明,在MapReduce的執行common join過程中,需要Shuffle進行重新洗牌的操作,也就是將key進行Hash處理後相同的值,放置在一塊,作為Reduce的輸入。這個過程中如果傳輸的量很大會造成一定的網路資源開銷,且可能造成OOM的問題,導致計算任務失敗。
若Shuffle是計算瓶頸所在,而Shuffle是為Reduce的前置作業,如何去除掉Shuffle,又能達到Reduce的結果。故Map Join就是用來解決此問題。
Map Join
|
|
執行計畫說明
上面為Map Join的執行計畫,下面將分段解釋所代表意思。
- 此執行計畫共分成3個Stage,依次為Stage-4、Stage-3、Stage-0,且由後往前為依賴關係,需要先執行完Stage-4,才能執行Stage-3,最後執行Stage-0。
|
|
- Map Reduce Local Work: 啟動一個MapReduce Local Task。對應圖中的MapReduce Local Task方框。
- TableScan
- alias: b: 根據MetaData中的統計訊息自動決定b表為小表,並且進行讀入數據。
- HashTable Sink Operator: 輸出HashTable並存到HDFS上,對應圖中的HashTable File。
|
|
- Stage: Stage-3: 為一個MapReduce Job,但其只包含Map Task,並將計算結果輸出至HDFS上。
小結
- 官方文檔 JoinOptimization
- 上圖為Map Join的流程圖以及執行計畫,Map Join會將小表全部讀入記憶體中,並產生HashTable檔案,接著將HashTable加載到分布式緩存中。在map階段,直接拿HashTable與大表中的紀錄做匹配。
- 由於在map中進行了join的操作,省去了shuffle與reduce的過程,執行效率提高很多,並且省去Shuffle階段要進行的大量資料傳輸,達成優化作業的作用。
- 使用Map Join時要注意小表的大小,因為會將小表全部寫入分布式記憶體中,故當所需記憶體過大時,會發生OOM。而小表的上限判定值,可以根據屬性來設定。
- 要使Map Join能夠順利進行,必須滿足: 除了一份表的資料分佈在不同的Map中之外,其他連接的表的資料必須在每個Map中有完整的複製。
Hive log
- 查看執行Common join跟Map Join對應的Hive Log。
- Hive Log的位置在$HIVE_HOME/conf/hive-log4j.properties中的以下參數設定:
- hive.log.dir=${java.io.tmpdir}/${user.name}
- hive.log.file=hive.log
Common Join Log
|
|
Common Join Log說明:
- 2017-09-23 18:01:04,757 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - 2017-09-23 18:01:04,757 Stage-1 map = 100%, reduce = 100%
- 需要進行Map與Reduce,也就是需要進行Shuffle操作。
Map Join Log
|
|
Map Join Log說明:
2017-09-23 18:00:05,122 INFO [main]: ql.Driver (Driver.java:launchTask(1772)) - Starting task [Stage-4:MAPREDLOCAL] in serial mode
- 此為一個MapReduce Local Task。對應Map Join圖中的MapReduce Local Task方框。
2017-09-23 18:00:10,526 INFO [main]: exec.Task (SessionState.java:printInfo(927)) - Number of reduce tasks is set to 0 since theres no reduce operator
- 此操作沒有Reduce Task