spark_core_monitoring_and_instrumentation

Spark Monitoring and Instrumentation

  • Spark monitoring
  • 每個SparkContext啟動,皆會啟動Web UI,預設在4040端口上,若4040端口被佔用則往上累加
  • 該Web UI包含了許多資訊,當中包含執行過程和調優的訊息
  • 當SparkContext的Job結束後,包含非正常結束,上述的Web UI也會消失
  • 為了不讓重要的資訊消失,可以在Spark job開始時,指定以下參數,那麼資訊將會被永久的保存:
    • spark.eventLog.enabled true
    • spark.eventLog.dir [Spark history server log directory]
  • 將上述資訊保存後,可以透過Spark history server來進行查看
  • Spark history server開啟後預設在18080端口執行,並列出Incomplete與Completed的Applications和Attempts
  • 使用下列參數指定已保存資訊的所在資料夾,讓Spark history server讀取,而在指定資料夾內的子資料夾放置每個Application的資訊
    • spark.history.fs.logDirectory

History Server Environment Variables and Configuration Options

  • Environment variables
    • SPARK_DAEMON_MEMORY: 在Application是長服務時,例如Spark thrift server,Log會非常多,此時預設的1G是無法正常把Log載入的,故需要調大
  • Configuration options
    • spark.history.fs.cleaner.enabled: 指定History server是否定時清理Log
    • spark.history.fs.cleaner.interval: 清理時間的間隔區段
    • spark.history.fs.cleaner.maxAge: 指定被清理的檔案年紀
    • spark.history.retainedApplications: 指定History server在cache中要保留幾個Application的UI Data,如果User在Web中要觀看不存在cache的UI Data,則Histroy server則會從Disk中讀取

History Server Demo

  • 設定spark-defaults.conf
1
2
3
4
5
6
7
[hadoop@testmain ~]$ cd $SPARK_HOME
[hadoop@testmain spark]$ vim conf/spark-defaults.conf
## 新增以下兩個語句,開啟History server,並指定Application Log寫入路徑,也可以指定成HDFS路徑
spark.eventLog.enabled true
spark.eventLog.dir file:///home/hadoop/spark_log
[hadoop@testmain spark]$ mkdir home/hadoop/spark_log
  • 啟動History server
1
2
[hadoop@testmain spark]$ ./sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/software/spark/logs/spark-hadoop-org.apache.spark.deploy.history.HistoryServer-1-testmain.out
  • 查看History server的Log,可以看到啟動時發生錯誤,以及錯誤原因
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
[hadoop@testmain spark]$ vim /opt/software/spark/logs/spark-hadoop-org.apache.spark.deploy.history.HistoryServer-1-testmain.out
## ...
Exception in thread "main" java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:278)
at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.io.FileNotFoundException: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:214)
at org.apache.spark.deploy.history.FsHistoryProvider.initialize(FsHistoryProvider.scala:160)
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:156)
at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:78)
... 6 more
Caused by: java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:204)
... 9 more
  • 由上述錯誤訊息file:/tmp/spark-events does not exist,使用官網反查file:/tmp/spark-events的參數得知,此值是spark.history.fs.logDirectory的預設值,用來指定讀取Application log的位置
  • 透過配置spark-env.sh指定Application log的放置位置
1
2
3
[hadoop@testmain spark]$ vi conf/spark-env.sh
## 也可以指定為HDFS路徑
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=file:///home/hadoop/spark_log"
  • 再次啟動History server,並查看Log
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
[hadoop@testmain spark]$ ./sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/software/spark/logs/spark-hadoop-org.apache.spark.deploy.history.HistoryServer-1-testmain.out
[hadoop@testmain spark]$ vim /opt/software/spark/logs/spark-hadoop-org.apache.spark.deploy.history.HistoryServer-1-testmain.out
Spark Command: /usr/java/jdk1.8.0_144/jre/bin/java -cp /opt/software/spark/conf/:/opt/software/spark/jars/*:/opt/software/hadoop/etc/hadoop/ -Dspark.history.fs.logDirectory=file:///home/hadoop/spark_log -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
18/03/24 01:22:18 INFO history.HistoryServer: Started daemon with process name: 12398@testmain
18/03/24 01:22:18 INFO util.SignalUtils: Registered signal handler for TERM
18/03/24 01:22:18 INFO util.SignalUtils: Registered signal handler for HUP
18/03/24 01:22:18 INFO util.SignalUtils: Registered signal handler for INT
18/03/24 01:22:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/24 01:22:20 INFO spark.SecurityManager: Changing view acls to: hadoop
18/03/24 01:22:20 INFO spark.SecurityManager: Changing modify acls to: hadoop
18/03/24 01:22:20 INFO spark.SecurityManager: Changing view acls groups to:
18/03/24 01:22:20 INFO spark.SecurityManager: Changing modify acls groups to:
18/03/24 01:22:20 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(hadoop); groups with view permissions: Set(); users with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/03/24 01:22:20 INFO history.FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
18/03/24 01:22:21 INFO util.log: Logging initialized @3377ms
18/03/24 01:22:21 INFO server.Server: jetty-9.3.z-SNAPSHOT
18/03/24 01:22:21 INFO server.Server: Started @3507ms
18/03/24 01:22:21 INFO server.AbstractConnector: Started ServerConnector@69ac76d8{HTTP/1.1,[http/1.1]}{0.0.0.0:18080}
18/03/24 01:22:21 INFO util.Utils: Successfully started service on port 18080.
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@736d6a5c{/,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@85e6769{/json,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@48a12036{/api,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e70a728{/static,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2eced48b{/history,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO history.HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://192.168.128.91:18080
## 使用jps查看History server是否有在運行
[hadoop@testmain spark]$ jps
12398 HistoryServer
12478 Jps
  • 啟動後可以使用Url來訪問History server
    • Url在Log的最後一行有顯示
  • 因為當前沒有任何的Application log,所以在Incomplete或Cmplete皆沒有可以顯示的資料
  • 使用spark-shell產生Application log
1
2
3
4
5
6
7
[hadoop@testmain spark]$ bin/spark-shell --master local[2] --jars /opt/software/hive/lib/mysql-connector-java-5.1.44-bin.jar
val lines = sc.textFile("file:///home/hadoop/testFile")
val words = lines.flatMap(_.split("\t"))
val pairs = words.map((_, 1))
val wordcount = pairs.reduceByKey(_+_, 5)
wordcount.collect
  • 在Incompleted可以看到對應的Application,以及點擊進去可以看到細項資訊,這些資訊與8080端口上的資訊是一模一樣的
  • 在sc.stop後,Application會從Incomplete移動到Complete
  • 換句話說,若沒有執行sc.stop,則Application則不會被移動到Complete
1
sc.stop
  • Complete Application對應的Web

REST API

  • Spark monitoring REST API
  • 透過History server提供的REST API,構建自定義的可視化或監控工具
  • 可以處理的對象包含Running application和History server
  • REST API的Endpoint詳細說明在官網中
  • REST API的內部實現以及History server web UI資料顯示,皆是到spark.history.fs.logDirectory讀取每個Application的Log,然後進行資料抽取