spark_core_monitoring_and_instrumentation

Spark Monitoring and Instrumentation

Spark monitoring
每個SparkContext啟動，皆會啟動Web UI，預設在4040端口上，若4040端口被佔用則往上累加
該Web UI包含了許多資訊，當中包含執行過程和調優的訊息
當SparkContext的Job結束後，包含非正常結束，上述的Web UI也會消失
為了不讓重要的資訊消失，可以在Spark job開始時，指定以下參數，那麼資訊將會被永久的保存:
- spark.eventLog.enabled true
- spark.eventLog.dir [Spark history server log directory]
將上述資訊保存後，可以透過Spark history server來進行查看
Spark history server開啟後預設在18080端口執行，並列出Incomplete與Completed的Applications和Attempts
使用下列參數指定已保存資訊的所在資料夾，讓Spark history server讀取，而在指定資料夾內的子資料夾放置每個Application的資訊
- spark.history.fs.logDirectory

History Server Environment Variables and Configuration Options

Environment variables
- SPARK_DAEMON_MEMORY: 在Application是長服務時，例如Spark thrift server，Log會非常多，此時預設的1G是無法正常把Log載入的，故需要調大
Configuration options
- spark.history.fs.cleaner.enabled: 指定History server是否定時清理Log
- spark.history.fs.cleaner.interval: 清理時間的間隔區段
- spark.history.fs.cleaner.maxAge: 指定被清理的檔案年紀
- spark.history.retainedApplications: 指定History server在cache中要保留幾個Application的UI Data，如果User在Web中要觀看不存在cache的UI Data，則Histroy server則會從Disk中讀取

History Server Demo

設定spark-defaults.conf

[hadoop@testmain ~]$ cd $SPARK_HOME
[hadoop@testmain spark]$ vim conf/spark-defaults.conf
## 新增以下兩個語句，開啟History server，並指定Application Log寫入路徑，也可以指定成HDFS路徑
spark.eventLog.enabled           true
spark.eventLog.dir               file:///home/hadoop/spark_log
[hadoop@testmain spark]$ mkdir home/hadoop/spark_log

啟動History server

1
2

[hadoop@testmain spark]$ ./sbin/start-history-server.sh
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/software/spark/logs/spark-hadoop-org.apache.spark.deploy.history.HistoryServer-1-testmain.out

查看History server的Log，可以看到啟動時發生錯誤，以及錯誤原因

[hadoop@testmain spark]$ vim /opt/software/spark/logs/spark-hadoop-org.apache.spark.deploy.history.HistoryServer-1-testmain.out
## ...
Exception in thread "main" java.lang.reflect.InvocationTargetException
        at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
        at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
        at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
        at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
        at org.apache.spark.deploy.history.HistoryServer$.main(HistoryServer.scala:278)
        at org.apache.spark.deploy.history.HistoryServer.main(HistoryServer.scala)
Caused by: java.io.FileNotFoundException: Log directory specified does not exist: file:/tmp/spark-events Did you configure the correct one through spark.history.fs.logDirectory?
        at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:214)
        at org.apache.spark.deploy.history.FsHistoryProvider.initialize(FsHistoryProvider.scala:160)
        at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:156)
        at org.apache.spark.deploy.history.FsHistoryProvider.<init>(FsHistoryProvider.scala:78)
        ... 6 more
Caused by: java.io.FileNotFoundException: File file:/tmp/spark-events does not exist
        at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
        at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
        at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
        at org.apache.spark.deploy.history.FsHistoryProvider.org$apache$spark$deploy$history$FsHistoryProvider$$startPolling(FsHistoryProvider.scala:204)
        ... 9 more

由上述錯誤訊息file:/tmp/spark-events does not exist，使用官網反查file:/tmp/spark-events的參數得知，此值是spark.history.fs.logDirectory的預設值，用來指定讀取Application log的位置
透過配置spark-env.sh指定Application log的放置位置

1
2
3

[hadoop@testmain spark]$ vi conf/spark-env.sh
## 也可以指定為HDFS路徑
SPARK_HISTORY_OPTS="-Dspark.history.fs.logDirectory=file:///home/hadoop/spark_log"

再次啟動History server，並查看Log

[hadoop@testmain spark]$ ./sbin/start-history-server.sh 
starting org.apache.spark.deploy.history.HistoryServer, logging to /opt/software/spark/logs/spark-hadoop-org.apache.spark.deploy.history.HistoryServer-1-testmain.out
[hadoop@testmain spark]$ vim /opt/software/spark/logs/spark-hadoop-org.apache.spark.deploy.history.HistoryServer-1-testmain.out
Spark Command: /usr/java/jdk1.8.0_144/jre/bin/java -cp /opt/software/spark/conf/:/opt/software/spark/jars/*:/opt/software/hadoop/etc/hadoop/ -Dspark.history.fs.logDirectory=file:///home/hadoop/spark_log -Xmx1g org.apache.spark.deploy.history.HistoryServer
========================================
18/03/24 01:22:18 INFO history.HistoryServer: Started daemon with process name: 12398@testmain
18/03/24 01:22:18 INFO util.SignalUtils: Registered signal handler for TERM
18/03/24 01:22:18 INFO util.SignalUtils: Registered signal handler for HUP
18/03/24 01:22:18 INFO util.SignalUtils: Registered signal handler for INT
18/03/24 01:22:19 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/03/24 01:22:20 INFO spark.SecurityManager: Changing view acls to: hadoop
18/03/24 01:22:20 INFO spark.SecurityManager: Changing modify acls to: hadoop
18/03/24 01:22:20 INFO spark.SecurityManager: Changing view acls groups to:
18/03/24 01:22:20 INFO spark.SecurityManager: Changing modify acls groups to:
18/03/24 01:22:20 INFO spark.SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(hadoop); groups with view permissions: Set(); users  with modify permissions: Set(hadoop); groups with modify permissions: Set()
18/03/24 01:22:20 INFO history.FsHistoryProvider: History server ui acls disabled; users with admin permissions: ; groups with admin permissions
18/03/24 01:22:21 INFO util.log: Logging initialized @3377ms
18/03/24 01:22:21 INFO server.Server: jetty-9.3.z-SNAPSHOT
18/03/24 01:22:21 INFO server.Server: Started @3507ms
18/03/24 01:22:21 INFO server.AbstractConnector: Started ServerConnector@69ac76d8{HTTP/1.1,[http/1.1]}{0.0.0.0:18080}
18/03/24 01:22:21 INFO util.Utils: Successfully started service on port 18080.
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@736d6a5c{/,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@85e6769{/json,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@48a12036{/api,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@4e70a728{/static,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO handler.ContextHandler: Started o.s.j.s.ServletContextHandler@2eced48b{/history,null,AVAILABLE,@Spark}
18/03/24 01:22:21 INFO history.HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://192.168.128.91:18080
## 使用jps查看History server是否有在運行
[hadoop@testmain spark]$ jps
12398 HistoryServer
12478 Jps

啟動後可以使用Url來訪問History server
- Url在Log的最後一行有顯示
因為當前沒有任何的Application log，所以在Incomplete或Cmplete皆沒有可以顯示的資料

使用spark-shell產生Application log

[hadoop@testmain spark]$ bin/spark-shell --master local[2] --jars /opt/software/hive/lib/mysql-connector-java-5.1.44-bin.jar
val lines = sc.textFile("file:///home/hadoop/testFile")
val words = lines.flatMap(_.split("\t"))
val pairs = words.map((_, 1))
val wordcount = pairs.reduceByKey(_+_, 5)
wordcount.collect

在Incompleted可以看到對應的Application，以及點擊進去可以看到細項資訊，這些資訊與8080端口上的資訊是一模一樣的

在sc.stop後，Application會從Incomplete移動到Complete
換句話說，若沒有執行sc.stop，則Application則不會被移動到Complete

sc.stop

Complete Application對應的Web

REST API

Spark monitoring REST API
透過History server提供的REST API，構建自定義的可視化或監控工具
可以處理的對象包含Running application和History server
REST API的Endpoint詳細說明在官網中
REST API的內部實現以及History server web UI資料顯示，皆是到spark.history.fs.logDirectory讀取每個Application的Log，然後進行資料抽取