Spark SQL working with Hive

Spark SQL

Spark SQL Introduction

  • Spark SQL
  • Spark SQL是Apache Spark中用來處理結構化數據的子模組。
  • 不僅僅侷限於處理SQL,還可以支持Hive、Parquet等外部數據源的操作與轉換。
  • 在Spark程式中直接使用SQL語句或DataFrame API。
  • 使用通用的方法訪問不同的外部數據源,如Hive, Avro, Parquet, ORC等,並且支持不同數據源間的Join操作。
  • 透過存取MetaStore對既有的Hive進行存取。
  • 支持JDBC與ODBC的訪問方式。

Spark SQL Working with Hive

  • Hive Tables
  • Spark SQL支援直接從Hive直接讀寫數據。
  • Spark SQL中沒有包含Hive的依賴庫,但只要能在classpath內找到(設置$HIVE_HOME環境變量),那麼Spark SQL可以自動載入依賴庫。
  • Hive的依賴庫必須在所有Spark的worker節點存在,Spark SQL才能透過Hive的序列化與反序列化操作(serialization and deserialization)進行Hive的數據存取。
  • 若生產環境中是跑在YARN的環境上,則配置hive-site.xml即可,透過指定YARN的conf目錄,可以自動找到core-site.xml與hdfs-site.xml。
  • Spark SQL在生產環境不需要安裝Hive,只要透過hive-site.xml指定MataStore的相關訊息,讓Spark SQL可以存取到MetaStore即可。
    

Preparing for Working with Hive

  • 啟動HDFS
    • Hive的數據存放在HDFS之上,所以要啟動HDFS
  • 將hive-site.xml複製到$SPARK_HOME/conf中
    • 讓Spark SQL得知MetaStore的存取方式
  • 確認MySQL運作正常
    • Hive的MetaData是存放在MetaStore中,而MetaStore是存放在MySQL當中,
  • 指定MySQL的Driver給Spark SQL
    • 透過 –jars指定
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
## 啟動HDFS
[hadoop@testmain ~]$ start-dfs.sh
Starting namenodes on [testmain]
testmain: starting namenode, logging to /opt/software/hadoop-2.8.1/logs/hadoop-hadoop-namenode-testmain.out
localhost: starting datanode, logging to /opt/software/hadoop-2.8.1/logs/hadoop-hadoop-datanode-testmain.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/software/hadoop-2.8.1/logs/hadoop-hadoop-secondarynamenode-testmain.out
## 透過jps確認HDFS相關程式運作正常(NameNode、DataNode)
[hadoop@testmain conf]$ jps
50201 SecondaryNameNode
49993 DataNode
50473 Jps
49882 NameNode
50442 SparkSubmit
## 複製hive-site.xml
[hadoop@testmain ~]$ cat $HIVE_HOME/conf/hive-site.xml
<configuration>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://localhost:3306/hive_database?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>com.mysql.jdbc.Driver</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>root</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>password</value>
</property>
</configuration>
[hadoop@testmain ~]$ cp $HIVE_HOME/conf/hive-site.xml $SPARK_HOME/conf/
## 確認MySQL運作正常
[hadoop@testmain ~]$ service mysql status
MySQL running (3812) [ OK ]
  • Create Hive Table
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
[hadoop@testmain ~]$ cat dept.txt
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
[hadoop@testmain ~]$ hive
## ...
hive> create table dept(
> deptno int,
> dname string,
> loc string
> )
> row format delimited fields terminated by '\t';
OK
Time taken: 1.661 seconds
hive> show tables;
OK
dept
Time taken: 0.228 seconds, Fetched: 1 row(s)
hive> load data local inpath '/home/hadoop/dept.txt' into table dept;
Loading data to table default.dept
Table default.dept stats: [numFiles=1, totalSize=80]
OK
Time taken: 1.42 seconds
hive> select * from dept;
OK
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
Time taken: 0.435 seconds, Fetched: 4 row(s)
  • spark-shell
    • 透過–jars指定MySQL Driver
    • 直接存取Hive寫入的數據
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
[hadoop@testmain ~]$ spark-shell --master local[2] --jars /opt/software/hive/lib/mysql-connector-java-5.1.44-bin.jar
## ...
scala> spark.sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
| default| dept| false|
+--------+---------+-----------+
scala> spark.sql("select * from dept").show
+------+----------+--------+
|deptno| dname| loc|
+------+----------+--------+
| 10|ACCOUNTING|NEW YORK|
| 20| RESEARCH| DALLAS|
| 30| SALES| CHICAGO|
| 40|OPERATIONS| BOSTON|
+------+----------+--------+
## 使用Spark API進行操作
scala> spark.table("dept").show
+------+----------+--------+
|deptno| dname| loc|
+------+----------+--------+
| 10|ACCOUNTING|NEW YORK|
| 20| RESEARCH| DALLAS|
| 30| SALES| CHICAGO|
| 40|OPERATIONS| BOSTON|
+------+----------+--------+
  • 在Web UI中顯示對應的作業
    • 右上角顯示為Spark shell提交的作業
  • spark-sql
    • 與spark-shell的參數無異
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
[hadoop@testmain ~]$ spark-sql --master local[2] --jars /opt/software/hive/lib/mysql-connector-java-5.1.44-bin.jar
## ...
spark-sql> show tables;
17/11/27 21:55:38 INFO execution.SparkSqlParser: Parsing command: show tables
17/11/27 21:55:40 INFO metastore.HiveMetaStore: 0: get_database: default
17/11/27 21:55:40 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: default
17/11/27 21:55:40 INFO metastore.HiveMetaStore: 0: get_database: default
17/11/27 21:55:40 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_database: default
17/11/27 21:55:40 INFO metastore.HiveMetaStore: 0: get_tables: db=default pat=*
17/11/27 21:55:40 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_tables: db=default pat=*
17/11/27 21:55:41 INFO codegen.CodeGenerator: Code generated in 500.008587 ms
default dept false
Time taken: 3.373 seconds, Fetched 1 row(s)
17/11/27 21:55:41 INFO CliDriver: Time taken: 3.373 seconds, Fetched 1 row(s)
spark-sql> select * from dept;
17/11/27 21:56:34 INFO execution.SparkSqlParser: Parsing command: select * from dept
17/11/27 21:56:34 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=dept
17/11/27 21:56:34 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=dept
17/11/27 21:56:35 INFO parser.CatalystSqlParser: Parsing command: int
17/11/27 21:56:35 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 21:56:35 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 21:56:35 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 282.0 KB, free 413.6 MB)
17/11/27 21:56:36 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.3 KB, free 413.6 MB)
17/11/27 21:56:36 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.128.91:35753 (size: 24.3 KB, free: 413.9 MB)
17/11/27 21:56:36 INFO spark.SparkContext: Created broadcast 0 from
17/11/27 21:56:36 INFO mapred.FileInputFormat: Total input paths to process : 1
17/11/27 21:56:36 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
17/11/27 21:56:36 INFO scheduler.DAGScheduler: Got job 0 (processCmd at CliDriver.java:376) with 1 output partitions
17/11/27 21:56:36 INFO scheduler.DAGScheduler: Final stage: ResultStage 0 (processCmd at CliDriver.java:376)
17/11/27 21:56:36 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/11/27 21:56:36 INFO scheduler.DAGScheduler: Missing parents: List()
17/11/27 21:56:36 INFO scheduler.DAGScheduler: Submitting ResultStage 0 (MapPartitionsRDD[4] at processCmd at CliDriver.java:376), which has no missing parents
17/11/27 21:56:37 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 8.2 KB, free 413.6 MB)
17/11/27 21:56:37 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 4.5 KB, free 413.6 MB)
17/11/27 21:56:37 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.128.91:35753 (size: 4.5 KB, free: 413.9 MB)
17/11/27 21:56:37 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/11/27 21:56:37 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0))
17/11/27 21:56:37 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/11/27 21:56:37 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 4879 bytes)
17/11/27 21:56:37 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/11/27 21:56:37 INFO executor.Executor: Fetching spark://192.168.128.91:60705/jars/mysql-connector-java-5.1.44-bin.jar with timestamp 1511790909399
17/11/27 21:56:37 INFO client.TransportClientFactory: Successfully created connection to /192.168.128.91:60705 after 74 ms (0 ms spent in bootstraps)
17/11/27 21:56:37 INFO util.Utils: Fetching spark://192.168.128.91:60705/jars/mysql-connector-java-5.1.44-bin.jar to /tmp/spark-cb2fad97-8950-4ceb-9648-99eb04fd3e86/userFiles-10f788de-0008-4937-8306-5d2476343fcc/fetchFileTemp1201349132026277310.tmp
17/11/27 21:56:39 INFO spark.ContextCleaner: Cleaned accumulator 0
17/11/27 21:56:39 INFO executor.Executor: Adding file:/tmp/spark-cb2fad97-8950-4ceb-9648-99eb04fd3e86/userFiles-10f788de-0008-4937-8306-5d2476343fcc/mysql-connector-java-5.1.44-bin.jar to class loader
17/11/27 21:56:39 INFO rdd.HadoopRDD: Input split: hdfs://192.168.128.91:9000/user/hive/warehouse/dept/dept.txt:0+80
17/11/27 21:56:39 INFO codegen.CodeGenerator: Code generated in 98.040246 ms
17/11/27 21:56:40 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 1323 bytes result sent to driver
17/11/27 21:56:40 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 3539 ms on localhost (executor driver) (1/1)
17/11/27 21:56:40 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/11/27 21:56:40 INFO scheduler.DAGScheduler: ResultStage 0 (processCmd at CliDriver.java:376) finished in 3.620 s
17/11/27 21:56:40 INFO scheduler.DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376, took 3.841103 s
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
Time taken: 6.141 seconds, Fetched 4 row(s)
17/11/27 21:56:40 INFO CliDriver: Time taken: 6.141 seconds, Fetched 4 row(s)
  • 在Web UI中顯示對應的作業

    • 右上角顯示為SparkSQL提交的作業
  • cache table

    • 對table進行cache操作,接下來的操作會直接從cache中讀取,降低讀取的數據量。
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
spark-sql> cache table dept;
17/11/27 22:32:58 INFO execution.SparkSqlParser: Parsing command: cache table dept
17/11/27 22:33:00 INFO execution.SparkSqlParser: Parsing command: `dept`
17/11/27 22:33:00 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=dept
17/11/27 22:33:00 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=dept
17/11/27 22:33:00 INFO parser.CatalystSqlParser: Parsing command: int
17/11/27 22:33:01 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 22:33:01 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 22:33:01 INFO memory.MemoryStore: Block broadcast_0 stored as values in memory (estimated size 282.0 KB, free 413.6 MB)
17/11/27 22:33:01 INFO memory.MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 24.3 KB, free 413.6 MB)
17/11/27 22:33:01 INFO storage.BlockManagerInfo: Added broadcast_0_piece0 in memory on 192.168.128.91:41099 (size: 24.3 KB, free: 413.9 MB)
17/11/27 22:33:02 INFO spark.SparkContext: Created broadcast 0 from
17/11/27 22:33:02 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=dept
17/11/27 22:33:02 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=dept
17/11/27 22:33:02 INFO parser.CatalystSqlParser: Parsing command: int
17/11/27 22:33:02 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 22:33:02 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 22:33:02 INFO spark.ContextCleaner: Cleaned accumulator 2
17/11/27 22:33:03 INFO codegen.CodeGenerator: Code generated in 393.10822 ms
17/11/27 22:33:03 INFO codegen.CodeGenerator: Code generated in 41.632135 ms
17/11/27 22:33:03 INFO mapred.FileInputFormat: Total input paths to process : 1
17/11/27 22:33:03 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
17/11/27 22:33:03 INFO scheduler.DAGScheduler: Registering RDD 7 (processCmd at CliDriver.java:376)
17/11/27 22:33:03 INFO scheduler.DAGScheduler: Got job 0 (processCmd at CliDriver.java:376) with 1 output partitions
17/11/27 22:33:03 INFO scheduler.DAGScheduler: Final stage: ResultStage 1 (processCmd at CliDriver.java:376)
17/11/27 22:33:03 INFO scheduler.DAGScheduler: Parents of final stage: List(ShuffleMapStage 0)
17/11/27 22:33:03 INFO scheduler.DAGScheduler: Missing parents: List(ShuffleMapStage 0)
17/11/27 22:33:03 INFO scheduler.DAGScheduler: Submitting ShuffleMapStage 0 (MapPartitionsRDD[7] at processCmd at CliDriver.java:376), which has no missing parents
17/11/27 22:33:03 INFO memory.MemoryStore: Block broadcast_1 stored as values in memory (estimated size 15.9 KB, free 413.6 MB)
17/11/27 22:33:03 INFO memory.MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 7.9 KB, free 413.6 MB)
17/11/27 22:33:03 INFO storage.BlockManagerInfo: Added broadcast_1_piece0 in memory on 192.168.128.91:41099 (size: 7.9 KB, free: 413.9 MB)
17/11/27 22:33:03 INFO spark.SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:1006
17/11/27 22:33:03 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[7] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0))
17/11/27 22:33:03 INFO scheduler.TaskSchedulerImpl: Adding task set 0.0 with 1 tasks
17/11/27 22:33:03 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, ANY, 4868 bytes)
17/11/27 22:33:04 INFO executor.Executor: Running task 0.0 in stage 0.0 (TID 0)
17/11/27 22:33:04 INFO executor.Executor: Fetching spark://192.168.128.91:58196/jars/mysql-connector-java-5.1.44-bin.jar with timestamp 1511793019825
17/11/27 22:33:04 INFO client.TransportClientFactory: Successfully created connection to /192.168.128.91:58196 after 80 ms (0 ms spent in bootstraps)
17/11/27 22:33:04 INFO util.Utils: Fetching spark://192.168.128.91:58196/jars/mysql-connector-java-5.1.44-bin.jar to /tmp/spark-9fb04cb6-aa8c-47fb-b491-17e363c6520b/userFiles-547b2023-5f1d-40d1-a944-a8078cd83294/fetchFileTemp5407950605281319861.tmp
17/11/27 22:33:04 INFO executor.Executor: Adding file:/tmp/spark-9fb04cb6-aa8c-47fb-b491-17e363c6520b/userFiles-547b2023-5f1d-40d1-a944-a8078cd83294/mysql-connector-java-5.1.44-bin.jar to class loader
17/11/27 22:33:04 INFO rdd.HadoopRDD: Input split: hdfs://192.168.128.91:9000/user/hive/warehouse/dept/dept.txt:0+80
17/11/27 22:33:04 INFO codegen.CodeGenerator: Code generated in 40.722113 ms
17/11/27 22:33:05 INFO memory.MemoryStore: Block rdd_4_0 stored as values in memory (estimated size 728.0 B, free 413.6 MB)
17/11/27 22:33:05 INFO storage.BlockManagerInfo: Added rdd_4_0 in memory on 192.168.128.91:41099 (size: 728.0 B, free: 413.9 MB)
17/11/27 22:33:05 INFO codegen.CodeGenerator: Code generated in 14.432673 ms
17/11/27 22:33:05 INFO codegen.CodeGenerator: Code generated in 51.00975 ms
17/11/27 22:33:05 INFO executor.Executor: Finished task 0.0 in stage 0.0 (TID 0). 2462 bytes result sent to driver
17/11/27 22:33:05 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 1592 ms on localhost (executor driver) (1/1)
17/11/27 22:33:05 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool
17/11/27 22:33:05 INFO scheduler.DAGScheduler: ShuffleMapStage 0 (processCmd at CliDriver.java:376) finished in 1.637 s
17/11/27 22:33:05 INFO scheduler.DAGScheduler: looking for newly runnable stages
17/11/27 22:33:05 INFO scheduler.DAGScheduler: running: Set()
17/11/27 22:33:05 INFO scheduler.DAGScheduler: waiting: Set(ResultStage 1)
17/11/27 22:33:05 INFO scheduler.DAGScheduler: failed: Set()
17/11/27 22:33:05 INFO scheduler.DAGScheduler: Submitting ResultStage 1 (MapPartitionsRDD[10] at processCmd at CliDriver.java:376), which has no missing parents
17/11/27 22:33:05 INFO memory.MemoryStore: Block broadcast_2 stored as values in memory (estimated size 7.0 KB, free 413.6 MB)
17/11/27 22:33:05 INFO memory.MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.7 KB, free 413.6 MB)
17/11/27 22:33:05 INFO storage.BlockManagerInfo: Added broadcast_2_piece0 in memory on 192.168.128.91:41099 (size: 3.7 KB, free: 413.9 MB)
17/11/27 22:33:05 INFO spark.SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:1006
17/11/27 22:33:05 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 1 (MapPartitionsRDD[10] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0))
17/11/27 22:33:05 INFO scheduler.TaskSchedulerImpl: Adding task set 1.0 with 1 tasks
17/11/27 22:33:05 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, executor driver, partition 0, ANY, 4726 bytes)
17/11/27 22:33:05 INFO executor.Executor: Running task 0.0 in stage 1.0 (TID 1)
17/11/27 22:33:05 INFO storage.ShuffleBlockFetcherIterator: Getting 1 non-empty blocks out of 1 blocks
17/11/27 22:33:05 INFO storage.ShuffleBlockFetcherIterator: Started 0 remote fetches in 10 ms
17/11/27 22:33:05 INFO executor.Executor: Finished task 0.0 in stage 1.0 (TID 1). 1581 bytes result sent to driver
17/11/27 22:33:05 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 1.0 (TID 1) in 78 ms on localhost (executor driver) (1/1)
17/11/27 22:33:05 INFO scheduler.DAGScheduler: ResultStage 1 (processCmd at CliDriver.java:376) finished in 0.076 s
17/11/27 22:33:05 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool
17/11/27 22:33:05 INFO scheduler.DAGScheduler: Job 0 finished: processCmd at CliDriver.java:376, took 2.039792 s
Time taken: 7.629 seconds
17/11/27 22:33:05 INFO CliDriver: Time taken: 7.629 seconds
spark-sql> 17/11/27 22:33:23 INFO storage.BlockManagerInfo: Removed broadcast_2_piece0 on 192.168.128.91:41099 in memory (size: 3.7 KB, free: 413.9 MB)
17/11/27 22:33:23 INFO spark.ContextCleaner: Cleaned accumulator 87
## 再進行一次操作
spark-sql> select * from dept;
17/11/27 22:37:50 INFO execution.SparkSqlParser: Parsing command: select * from dept
17/11/27 22:37:50 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=dept
17/11/27 22:37:50 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=dept
17/11/27 22:37:50 INFO parser.CatalystSqlParser: Parsing command: int
17/11/27 22:37:50 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 22:37:50 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 22:37:50 INFO spark.SparkContext: Starting job: processCmd at CliDriver.java:376
17/11/27 22:37:50 INFO scheduler.DAGScheduler: Got job 1 (processCmd at CliDriver.java:376) with 1 output partitions
17/11/27 22:37:50 INFO scheduler.DAGScheduler: Final stage: ResultStage 2 (processCmd at CliDriver.java:376)
17/11/27 22:37:50 INFO scheduler.DAGScheduler: Parents of final stage: List()
17/11/27 22:37:50 INFO scheduler.DAGScheduler: Missing parents: List()
17/11/27 22:37:50 INFO scheduler.DAGScheduler: Submitting ResultStage 2 (MapPartitionsRDD[12] at processCmd at CliDriver.java:376), which has no missing parents
17/11/27 22:37:50 INFO memory.MemoryStore: Block broadcast_3 stored as values in memory (estimated size 11.6 KB, free 413.6 MB)
17/11/27 22:37:50 INFO memory.MemoryStore: Block broadcast_3_piece0 stored as bytes in memory (estimated size 5.9 KB, free 413.6 MB)
17/11/27 22:37:50 INFO storage.BlockManagerInfo: Added broadcast_3_piece0 in memory on 192.168.128.91:41099 (size: 5.9 KB, free: 413.9 MB)
17/11/27 22:37:50 INFO spark.SparkContext: Created broadcast 3 from broadcast at DAGScheduler.scala:1006
17/11/27 22:37:50 INFO scheduler.DAGScheduler: Submitting 1 missing tasks from ResultStage 2 (MapPartitionsRDD[12] at processCmd at CliDriver.java:376) (first 15 tasks are for partitions Vector(0))
17/11/27 22:37:50 INFO scheduler.TaskSchedulerImpl: Adding task set 2.0 with 1 tasks
17/11/27 22:37:50 INFO scheduler.TaskSetManager: Starting task 0.0 in stage 2.0 (TID 2, localhost, executor driver, partition 0, PROCESS_LOCAL, 4879 bytes)
17/11/27 22:37:50 INFO executor.Executor: Running task 0.0 in stage 2.0 (TID 2)
17/11/27 22:37:50 INFO storage.BlockManager: Found block rdd_4_0 locally
17/11/27 22:37:50 INFO codegen.CodeGenerator: Code generated in 49.98898 ms
17/11/27 22:37:50 INFO executor.Executor: Finished task 0.0 in stage 2.0 (TID 2). 1385 bytes result sent to driver
17/11/27 22:37:50 INFO scheduler.TaskSetManager: Finished task 0.0 in stage 2.0 (TID 2) in 105 ms on localhost (executor driver) (1/1)
17/11/27 22:37:50 INFO scheduler.DAGScheduler: ResultStage 2 (processCmd at CliDriver.java:376) finished in 0.103 s
17/11/27 22:37:50 INFO scheduler.DAGScheduler: Job 1 finished: processCmd at CliDriver.java:376, took 0.131798 s
17/11/27 22:37:50 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool
10 ACCOUNTING NEW YORK
20 RESEARCH DALLAS
30 SALES CHICAGO
40 OPERATIONS BOSTON
Time taken: 0.358 seconds, Fetched 4 row(s)
17/11/27 22:37:50 INFO CliDriver: Time taken: 0.358 seconds, Fetched 4 row(s)
spark-sql> 17/11/27 22:38:12 INFO storage.BlockManagerInfo: Removed broadcast_3_piece0 on 192.168.128.91:41099 in memory (size: 5.9 KB, free: 413.9 MB)
17/11/27 22:38:12 INFO spark.ContextCleaner: Cleaned accumulator 88
  • 執行cache後,在Web UI中的Storage頁面顯示對應的資訊

  • 執行select後,在Web UI中顯示對應的作業

    • 可以察覺讀入的數據為728B,與cache的大小相同。
    • Spark SQL中的cache操作為eager的,與Spark Core中的cache操作不同,Spark Core中的cache操作為lazy的。
  • uncache table
    - 執行uncache後,Web UI中的Storage頁面的對應資訊也會消失
1
2
3
4
5
6
7
8
9
10
11
12
spark-sql> uncache table dept;
17/11/27 22:39:44 INFO execution.SparkSqlParser: Parsing command: uncache table dept
17/11/27 22:39:44 INFO execution.SparkSqlParser: Parsing command: `dept`
17/11/27 22:39:44 INFO metastore.HiveMetaStore: 0: get_table : db=default tbl=dept
17/11/27 22:39:44 INFO HiveMetaStore.audit: ugi=hadoop ip=unknown-ip-addr cmd=get_table : db=default tbl=dept
17/11/27 22:39:44 INFO parser.CatalystSqlParser: Parsing command: int
17/11/27 22:39:44 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 22:39:44 INFO parser.CatalystSqlParser: Parsing command: string
17/11/27 22:39:44 INFO rdd.MapPartitionsRDD: Removing RDD 4 from persistence list
17/11/27 22:39:44 INFO storage.BlockManager: Removing RDD 4
Time taken: 0.158 seconds
17/11/27 22:39:44 INFO CliDriver: Time taken: 0.158 seconds