Compile Spark Source Code

Compile Spark Source Code

Why Spark

編譯步驟

基本訊息

  • OS: CentOS 6.5 64bit/macOS Sierra
  • JDK: 8u144
  • Maven: 3.3.9(Spark source code自帶)
  • Apache Spark下載位置 如上圖,在Spark頁面選擇Source code進行下載,並進行解壓縮。
1
2
3
4
5
6
7
[root@hadoop-01 sourcecode]# ls
spark-2.2.0.tgz ## 已下載的Spark 2.2.0 source code壓縮包
[root@hadoop-01 sourcecode]# tar -zxvf spark-2.2.0.tgz
[root@hadoop-01 sourcecode]# ls
spark-2.2.0
spark-2.2.0.tgz
[root@hadoop-01 sourcecode]# cd spark-2.2.0
  • Building Spark官方文檔
  • Spark官方文檔詳細描述如何編譯Spark,下面列出Spark 2.2.0編譯所需步驟與指令。

Building Spark using Maven requires:

  • Maven 3.3.9 or newer
  • Java 8+, Java 7 was removed as of Spark 2.2.0.
  • Set JAVA_HOME
    1
    2
    3
    4
    5
    6
    7
    8
    [root@hadoop-01 spark-2.2.0]# cd build/ ## Spark source code目錄內自帶的Maven
    [root@hadoop-01 build]# mvn -version ## 確認Maven與Java版本,分別為Maven 3.3.9與Java 1.8.0_144
    Apache Maven 3.3.9 (bb52d8502b132ec0a5a3f4c09453c07478323dc5; 2015-11-11T00:41:47+08:00)
    Maven home: /opt/software/apache-maven-3.3.9
    Java version: 1.8.0_144, vendor: Oracle Corporation
    Java home: /usr/java/jdk1.8.0_144/jre
    Default locale: en_US, platform encoding: UTF-8
    OS name: "linux", version: "2.6.32-431.el6.x86_64", arch: "amd64", family: "unix"

Setting up Maven’s Memory Usage

  • 使用MAVEN_OPTS參數設置Maven可使用的記憶體上限。
    1
    [root@hadoop-01 spark-2.2.0]# export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"

build/mvn

  • 使用Spark source code內在build目錄下自帶Maven進行編譯
  • 此命令會自動下載編譯所需的資源(Maven, Scala, and Zinc)在build目錄下並安裝
  • 下列指令使用預設值編譯Spark
    1
    [root@hadoop-01 spark-2.2.0]# ./build/mvn -DskipTests clean package

Building a Runnable Distribution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
[root@hadoop-01 spark-2.2.0]# vim pom.xml ## 首先觀察pom.xml中,需要關注的內容
<properties>
<hadoop.version>2.6.5</hadoop.version>
<protobuf.version>2.5.0</protobuf.version>
<yarn.version>${hadoop.version}</yarn.version>
</properties>
<!-- A series of build profiles where customizations for particular Hadoop releases can be made -->
<profiles>
<profile>
<id>hadoop-2.6</id>
<!-- Default hadoop profile. Uses global properties. -->
</profile>
<profile>
<id>hadoop-2.7</id>
<properties>
<hadoop.version>2.7.3</hadoop.version>
</properties>
</profile>
<profile>
<id>yarn</id>
<modules>
<module>resource-managers/yarn</module>
<module>common/network-yarn</module>
</modules>
</profile>
<profile>
<id>hive-thriftserver</id>
<modules>
<module>sql/hive-thriftserver</module>
</modules>
</profile>
</profiles>
  • 由pom.xml得知訊息
    • properties section:
      • 編譯預設支援hadoop2.6.5
      • 編譯預設支援yarn的版本與hadoop相同
    • profiles section:
      • 可支援的hadoop為2.6與2.7,預設為2.6
  • 目標平台使用Hadoop版本為2.8.1,使用下述命令進行編譯
    • ./dev/make-distribution.sh –name spark-2.2.0 –tgz -Pyarn -Phadoop-2.8 -Phive -Phive-thriftserver -Dhadoop.version=2.8.1
  • 參數說明:
    • –name: 輸出名稱,make-distribution.sh內指出此參數將會替代spark-$VERSION-bin-$NAME中的$NAME
    • –tgz: 輸出tgz包
    • -Pyarn: 支援yarn, 指定profile中對應的id
    • -Phadoop-2.7: 由Spark官方文檔得知,若要支援Hadoop 2.7+的版本,此參數指定為-Phadoop-2.7
    • -Phive: 支援hive
    • -Phive-thriftserver: 支援hive-thriftserver,指定profile中對應的id
    • -Dhadoop.version=2.8.1: 指定property中Hadoop版本為2.8.1,替換掉pom.xml內的2.6.5
  • make-distribution.sh內最終會執行/build/mvn,並自動設置MAVEN_OPTS,以及在編譯命令中加上 -DskipTests clean package
  • 在編譯命令中加上-X,可以得到詳細的輸出訊息,可用於錯誤原因確認
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    [root@hadoop-01 spark-2.2.0]# ./dev/make-distribution.sh --name spark-2.2.0 --tgz -Pyarn -Phadoop-2.8 -Phive -Phive-thriftserver -Dhadoop.version=2.8.1
    main:
    [INFO] Executed tasks
    [INFO] ------------------------------------------------------------------------
    [INFO] Reactor Summary:
    [INFO]
    [INFO] Spark Project Parent POM ........................... SUCCESS [ 7.014 s]
    [INFO] Spark Project Tags ................................. SUCCESS [ 7.707 s]
    [INFO] Spark Project Sketch ............................... SUCCESS [ 6.783 s]
    [INFO] Spark Project Networking ........................... SUCCESS [ 20.351 s]
    [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 15.513 s]
    [INFO] Spark Project Unsafe ............................... SUCCESS [ 15.655 s]
    [INFO] Spark Project Launcher ............................. SUCCESS [ 19.657 s]
    [INFO] Spark Project Core ................................. SUCCESS [03:15 min]
    [INFO] Spark Project ML Local Library ..................... SUCCESS [ 15.572 s]
    [INFO] Spark Project GraphX ............................... SUCCESS [ 27.722 s]
    [INFO] Spark Project Streaming ............................ SUCCESS [01:01 min]
    [INFO] Spark Project Catalyst ............................. SUCCESS [01:50 min]
    [INFO] Spark Project SQL .................................. SUCCESS [02:45 min]
    [INFO] Spark Project ML Library ........................... SUCCESS [01:51 min]
    [INFO] Spark Project Tools ................................ SUCCESS [ 2.650 s]
    [INFO] Spark Project Hive ................................. SUCCESS [ 59.211 s]
    [INFO] Spark Project REPL ................................. SUCCESS [ 8.718 s]
    [INFO] Spark Project YARN Shuffle Service ................. SUCCESS [ 18.745 s]
    [INFO] Spark Project YARN ................................. SUCCESS [ 18.137 s]
    [INFO] Spark Project Hive Thrift Server ................... SUCCESS [ 38.423 s]
    [INFO] Spark Project Assembly ............................. SUCCESS [ 4.710 s]
    [INFO] Spark Project External Flume Sink .................. SUCCESS [ 16.745 s]
    [INFO] Spark Project External Flume ....................... SUCCESS [ 17.172 s]
    [INFO] Spark Project External Flume Assembly .............. SUCCESS [ 4.797 s]
    [INFO] Spark Integration for Kafka 0.8 .................... SUCCESS [ 15.995 s]
    [INFO] Kafka 0.10 Source for Structured Streaming ......... SUCCESS [ 13.452 s]
    [INFO] Spark Project Examples ............................. SUCCESS [ 30.937 s]
    [INFO] Spark Project External Kafka Assembly .............. SUCCESS [ 6.090 s]
    [INFO] Spark Integration for Kafka 0.10 ................... SUCCESS [ 14.209 s]
    [INFO] Spark Integration for Kafka 0.10 Assembly .......... SUCCESS [ 5.213 s]
    [INFO] ------------------------------------------------------------------------
    [INFO] BUILD SUCCESS
    [INFO] ------------------------------------------------------------------------
    [INFO] Total time: 17:37 min
    [INFO] Finished at: 2017-09-02T00:22:22+08:00
    [INFO] Final Memory: 83M/352M
    + TARDIR_NAME=spark-2.2.0-bin-custom-spark
    + TARDIR=/opt/sourcecode/spark-2.2.0/spark-2.2.0-bin-custom-spark
    ## 由TARDIR_NAME與TARDIR變量可得知編譯輸出tgz包的名稱與位置
    [root@hadoop-01 spark-2.2.0]# ll | grep spark-2.2.0-bin-custom-spark
    -rw-r--r--. 1 root root 194582530 Sep 2 00:22 spark-2.2.0-bin-custom-spark.tgz