Spark SQL DataFrame Operations

Code

import org.apache.spark.sql.SparkSession
object DataFrameDemo {
  def main(args: Array[String]): Unit = {
    val spark = SparkSession
      .builder().master("local[2]")
      .appName("DataFrameDemo")
      .getOrCreate()
    val df = spark.read.format("json").load("file/people.json")
    println("Displays the content of the DataFrame to stdout")
    df.show()
    println("Print the schema in a tree format")
    df.printSchema()
    println("Select only the name column")
    df.select("name").show()
    // This import is needed to use the $-notation
    import spark.implicits._
    println("Select everybody, but increment the age by 1")
    df.select($"name", $"age" + 1).show()
    println("Select people older than 21")
    df.filter($"age" > 21).show()
    spark.stop()
  }
}

Explanation

來源檔案內容

[hadoop@hadoop-01 file]# cat people.json 
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

對應輸出

df.show()
- 顯示當前DataFrame內的紀錄
- 根據DataFrame源碼顯示，show()只會顯示前20條紀錄

Displays the content of the DataFrame to stdout
+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+

df.printSchema()
- 透過樹狀結構將DataFrame中的schema列印出來
- DataFrame會自行對json格式內的紀錄進行類型推導，譬如age的類型為long
df.select(“name”).show()
- 顯示特定column name的紀錄
- 指定的column name入參為String類型
- API source: select(col: String, cols: String*): DataFrame

Print the schema in a tree format
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
Select only the name column
+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+

import spark.implicits._
- 引用隱式轉換，確保column類型可以正常使用
df.select($”name”, $”age” + 1).show()
- 入參為column類型
- API source: select(cols: Column*): DataFrame
df.filter($”age” > 21).show()
- 透過columnt輸入過濾條件

Select everybody, but increment the age by 1
+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+
Select people older than 21
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+