Spark SQL DataFrame Operations

Spark SQL DataFrame Operations

Code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
import org.apache.spark.sql.SparkSession
object DataFrameDemo {
def main(args: Array[String]): Unit = {
val spark = SparkSession
.builder().master("local[2]")
.appName("DataFrameDemo")
.getOrCreate()
val df = spark.read.format("json").load("file/people.json")
println("Displays the content of the DataFrame to stdout")
df.show()
println("Print the schema in a tree format")
df.printSchema()
println("Select only the name column")
df.select("name").show()
// This import is needed to use the $-notation
import spark.implicits._
println("Select everybody, but increment the age by 1")
df.select($"name", $"age" + 1).show()
println("Select people older than 21")
df.filter($"age" > 21).show()
spark.stop()
}
}

Explanation

來源檔案內容

1
2
3
4
[hadoop@hadoop-01 file]# cat people.json
{"name":"Michael"}
{"name":"Andy", "age":30}
{"name":"Justin", "age":19}

對應輸出

  • df.show()
    • 顯示當前DataFrame內的紀錄
    • 根據DataFrame源碼顯示,show()只會顯示前20條紀錄
1
2
3
4
5
6
7
8
Displays the content of the DataFrame to stdout
+----+-------+
| age| name|
+----+-------+
|null|Michael|
| 30| Andy|
| 19| Justin|
+----+-------+
  • df.printSchema()
    • 透過樹狀結構將DataFrame中的schema列印出來
    • DataFrame會自行對json格式內的紀錄進行類型推導,譬如age的類型為long
  • df.select(“name”).show()
    • 顯示特定column name的紀錄
    • 指定的column name入參為String類型
    • API source: select(col: String, cols: String*): DataFrame
1
2
3
4
5
6
7
8
9
10
11
12
13
Print the schema in a tree format
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
Select only the name column
+-------+
| name|
+-------+
|Michael|
| Andy|
| Justin|
+-------+
  • import spark.implicits._
    • 引用隱式轉換,確保column類型可以正常使用
  • df.select($”name”, $”age” + 1).show()
    • 入參為column類型
    • API source: select(cols: Column*): DataFrame
  • df.filter($”age” > 21).show()
    • 透過columnt輸入過濾條件
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Select everybody, but increment the age by 1
+-------+---------+
| name|(age + 1)|
+-------+---------+
|Michael| null|
| Andy| 31|
| Justin| 20|
+-------+---------+
Select people older than 21
+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+