Compere Alluxio with HDFS

Compere Alluxio with HDFS

Purpose

  1. 在Hadoop上,對Alluxio與HDFS中的文件進行操作,比較其讀入的數據量和執行速度。
  2. 在Spark上,對Alluxio與HDFS中的文件進行操作,比較其讀入的數據量和執行速度。

Prepare for Testing File

1
2
3
4
5
6
7
8
9
10
11
# 測試文件大小大約為18.2MB
[hadoop@testmain ~]# ll page_views.dat
-rwxr-xr-x 1 root root 19014993 Nov 5 00:21 page_views.dat
# 將文件上傳至Alluxio
[hadoop@testmain ~]# alluxio fs mkdir /wordcount/input/
Successfully created directory /wordcount/input/
[root@testmain ~]# alluxio fs copyFromLocal page_views.dat /wordcount/input/
# 將文件上傳至HDFS
[hadoop@testmain ~]$ hdfs dfs -put page_views.dat /wordcount/input/

Hadoop Testing

  • 使用Hadoop自帶的測試包中的wordcount,並分別讀取Alluxio與HDFS中的文件,進行測試。
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    167
    168
    169
    170
    171
    172
    173
    174
    175
    176
    177
    178
    179
    180
    181
    182
    183
    184
    185
    186
    187
    188
    # 對Alluxio中的文件進行測試
    [hadoop@testmain ~]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount -libjars $ALLUXIO_HOME/client/hadoop/alluxio-1.6.0-hadoop-client.jar alluxio://localhost:19998/wordcount/input/page_views.dat alluxio://localhost:19998/wordcount/output
    # ...
    17/11/08 01:24:17 INFO mapreduce.Job: map 0% reduce 0%
    17/11/08 01:24:19 INFO mapred.LocalJobRunner:
    17/11/08 01:24:19 INFO mapred.MapTask: Starting flush of map output
    17/11/08 01:24:19 INFO mapred.MapTask: Spilling map output
    17/11/08 01:24:19 INFO mapred.MapTask: bufstart = 0; bufend = 22090342; bufvoid = 104857600
    17/11/08 01:24:19 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 23122784(92491136); length = 3091613/6553600
    17/11/08 01:24:21 INFO mapred.MapTask: Finished spill 0
    17/11/08 01:24:21 INFO mapred.Task: Task:attempt_local1066655942_0001_m_000000_0 is done. And is in the process of committing
    17/11/08 01:24:21 INFO mapred.LocalJobRunner: map
    17/11/08 01:24:21 INFO mapred.Task: Task 'attempt_local1066655942_0001_m_000000_0' done.
    17/11/08 01:24:21 INFO mapred.LocalJobRunner: Finishing task: attempt_local1066655942_0001_m_000000_0
    17/11/08 01:24:21 INFO mapred.LocalJobRunner: map task executor complete.
    17/11/08 01:24:21 INFO mapred.LocalJobRunner: Waiting for reduce tasks
    17/11/08 01:24:21 INFO mapred.LocalJobRunner: Starting task: attempt_local1066655942_0001_r_000000_0
    17/11/08 01:24:21 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/11/08 01:24:21 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
    17/11/08 01:24:21 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
    17/11/08 01:24:21 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@7df3efe2
    17/11/08 01:24:21 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10
    17/11/08 01:24:21 INFO reduce.EventFetcher: attempt_local1066655942_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
    17/11/08 01:24:21 INFO mapreduce.Job: map 100% reduce 0%
    17/11/08 01:24:21 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local1066655942_0001_m_000000_0 decomp: 7915818 len: 7915822 to MEMORY
    17/11/08 01:24:21 INFO reduce.InMemoryMapOutput: Read 7915818 bytes from map-output for attempt_local1066655942_0001_m_000000_0
    17/11/08 01:24:21 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 7915818, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->7915818
    17/11/08 01:24:21 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
    17/11/08 01:24:21 INFO mapred.LocalJobRunner: 1 / 1 copied.
    17/11/08 01:24:21 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
    17/11/08 01:24:21 INFO mapred.Merger: Merging 1 sorted segments
    17/11/08 01:24:21 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7915768 bytes
    17/11/08 01:24:22 INFO reduce.MergeManagerImpl: Merged 1 segments, 7915818 bytes to disk to satisfy reduce memory limit
    17/11/08 01:24:22 INFO reduce.MergeManagerImpl: Merging 1 files, 7915822 bytes from disk
    17/11/08 01:24:22 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
    17/11/08 01:24:22 INFO mapred.Merger: Merging 1 sorted segments
    17/11/08 01:24:22 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7915768 bytes
    17/11/08 01:24:22 INFO mapred.LocalJobRunner: 1 / 1 copied.
    17/11/08 01:24:22 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
    17/11/08 01:24:23 INFO mapred.Task: Task:attempt_local1066655942_0001_r_000000_0 is done. And is in the process of committing
    17/11/08 01:24:23 INFO mapred.LocalJobRunner: 1 / 1 copied.
    17/11/08 01:24:23 INFO mapred.Task: Task attempt_local1066655942_0001_r_000000_0 is allowed to commit now
    17/11/08 01:24:23 INFO output.FileOutputCommitter: Saved output of task 'attempt_local1066655942_0001_r_000000_0' to alluxio://localhost:19998/wordcount/output/_temporary/0/task_local1066655942_0001_r_000000
    17/11/08 01:24:23 INFO mapred.LocalJobRunner: reduce > reduce
    17/11/08 01:24:23 INFO mapred.Task: Task 'attempt_local1066655942_0001_r_000000_0' done.
    17/11/08 01:24:23 INFO mapred.LocalJobRunner: Finishing task: attempt_local1066655942_0001_r_000000_0
    17/11/08 01:24:23 INFO mapred.LocalJobRunner: reduce task executor complete.
    17/11/08 01:24:24 INFO mapreduce.Job: map 100% reduce 100%
    17/11/08 01:24:24 INFO mapreduce.Job: Job job_local1066655942_0001 completed successfully
    17/11/08 01:24:24 INFO mapreduce.Job: Counters: 40
    File System Counters
    ALLUXIO: Number of bytes read=38029986
    ALLUXIO: Number of bytes written=7281357
    ALLUXIO: Number of read operations=13
    ALLUXIO: Number of large read operations=0
    ALLUXIO: Number of write operations=4
    FILE: Number of bytes read=52010784
    FILE: Number of bytes written=60859072
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=0
    HDFS: Number of bytes written=0
    HDFS: Number of read operations=0
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=0
    Map-Reduce Framework
    Map input records=100000
    Map output records=772904
    Map output bytes=22090342
    Map output materialized bytes=7915822
    Input split bytes=121
    Combine input records=772904
    Combine output records=157240
    Reduce input groups=157240
    Reduce shuffle bytes=7915822
    Reduce input records=157240
    Reduce output records=157240
    Spilled Records=314480
    Shuffled Maps =1
    Failed Shuffles=0
    Merged Map outputs=1
    GC time elapsed (ms)=85
    Total committed heap usage (bytes)=331489280
    Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
    File Input Format Counters
    Bytes Read=19014993
    File Output Format Counters
    Bytes Written=7281357
    17/11/08 01:24:24 INFO connection.NettyChannelPool: Channel closed
    # 對HDFS中的文件進行測試
    [hadoop@testmain ~]$ hadoop jar $HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar wordcount hdfs://192.168.128.91:9000/wordcount/input/page_views.dat hdfs://192.168.128.91:9000/wordcount/output
    # ...
    17/11/08 01:25:08 INFO mapreduce.Job: map 0% reduce 0%
    17/11/08 01:25:09 INFO mapred.LocalJobRunner:
    17/11/08 01:25:09 INFO mapred.MapTask: Starting flush of map output
    17/11/08 01:25:09 INFO mapred.MapTask: Spilling map output
    17/11/08 01:25:09 INFO mapred.MapTask: bufstart = 0; bufend = 22090342; bufvoid = 104857600
    17/11/08 01:25:09 INFO mapred.MapTask: kvstart = 26214396(104857584); kvend = 23122784(92491136); length = 3091613/6553600
    17/11/08 01:25:11 INFO mapred.MapTask: Finished spill 0
    17/11/08 01:25:11 INFO mapred.Task: Task:attempt_local556577714_0001_m_000000_0 is done. And is in the process of committing
    17/11/08 01:25:11 INFO mapred.LocalJobRunner: map
    17/11/08 01:25:11 INFO mapred.Task: Task 'attempt_local556577714_0001_m_000000_0' done.
    17/11/08 01:25:11 INFO mapred.LocalJobRunner: Finishing task: attempt_local556577714_0001_m_000000_0
    17/11/08 01:25:11 INFO mapred.LocalJobRunner: map task executor complete.
    17/11/08 01:25:11 INFO mapred.LocalJobRunner: Waiting for reduce tasks
    17/11/08 01:25:11 INFO mapred.LocalJobRunner: Starting task: attempt_local556577714_0001_r_000000_0
    17/11/08 01:25:11 INFO output.FileOutputCommitter: File Output Committer Algorithm version is 1
    17/11/08 01:25:11 INFO output.FileOutputCommitter: FileOutputCommitter skip cleanup _temporary folders under output directory:false, ignore cleanup failures: false
    17/11/08 01:25:11 INFO mapred.Task: Using ResourceCalculatorProcessTree : [ ]
    17/11/08 01:25:11 INFO mapred.ReduceTask: Using ShuffleConsumerPlugin: org.apache.hadoop.mapreduce.task.reduce.Shuffle@38c32694
    17/11/08 01:25:11 INFO reduce.MergeManagerImpl: MergerManager: memoryLimit=363285696, maxSingleShuffleLimit=90821424, mergeThreshold=239768576, ioSortFactor=10, memToMemMergeOutputsThreshold=10
    17/11/08 01:25:11 INFO reduce.EventFetcher: attempt_local556577714_0001_r_000000_0 Thread started: EventFetcher for fetching Map Completion Events
    17/11/08 01:25:11 INFO reduce.LocalFetcher: localfetcher#1 about to shuffle output of map attempt_local556577714_0001_m_000000_0 decomp: 7915818 len: 7915822 to MEMORY
    17/11/08 01:25:11 INFO reduce.InMemoryMapOutput: Read 7915818 bytes from map-output for attempt_local556577714_0001_m_000000_0
    17/11/08 01:25:11 INFO reduce.MergeManagerImpl: closeInMemoryFile -> map-output of size: 7915818, inMemoryMapOutputs.size() -> 1, commitMemory -> 0, usedMemory ->7915818
    17/11/08 01:25:11 INFO reduce.EventFetcher: EventFetcher is interrupted.. Returning
    17/11/08 01:25:11 INFO mapred.LocalJobRunner: 1 / 1 copied.
    17/11/08 01:25:11 INFO reduce.MergeManagerImpl: finalMerge called with 1 in-memory map-outputs and 0 on-disk map-outputs
    17/11/08 01:25:11 INFO mapred.Merger: Merging 1 sorted segments
    17/11/08 01:25:11 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7915768 bytes
    17/11/08 01:25:11 INFO reduce.MergeManagerImpl: Merged 1 segments, 7915818 bytes to disk to satisfy reduce memory limit
    17/11/08 01:25:11 INFO reduce.MergeManagerImpl: Merging 1 files, 7915822 bytes from disk
    17/11/08 01:25:11 INFO reduce.MergeManagerImpl: Merging 0 segments, 0 bytes from memory into reduce
    17/11/08 01:25:11 INFO mapred.Merger: Merging 1 sorted segments
    17/11/08 01:25:11 INFO mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 7915768 bytes
    17/11/08 01:25:11 INFO mapred.LocalJobRunner: 1 / 1 copied.
    17/11/08 01:25:11 INFO Configuration.deprecation: mapred.skip.on is deprecated. Instead, use mapreduce.job.skiprecords
    17/11/08 01:25:11 INFO mapreduce.Job: map 100% reduce 0%
    17/11/08 01:25:12 INFO mapred.Task: Task:attempt_local556577714_0001_r_000000_0 is done. And is in the process of committing
    17/11/08 01:25:12 INFO mapred.LocalJobRunner: 1 / 1 copied.
    17/11/08 01:25:12 INFO mapred.Task: Task attempt_local556577714_0001_r_000000_0 is allowed to commit now
    17/11/08 01:25:12 INFO output.FileOutputCommitter: Saved output of task 'attempt_local556577714_0001_r_000000_0' to hdfs://192.168.128.91:9000/wordcount/output/_temporary/0/task_local556577714_0001_r_000000
    17/11/08 01:25:12 INFO mapred.LocalJobRunner: reduce > reduce
    17/11/08 01:25:12 INFO mapred.Task: Task 'attempt_local556577714_0001_r_000000_0' done.
    17/11/08 01:25:12 INFO mapred.LocalJobRunner: Finishing task: attempt_local556577714_0001_r_000000_0
    17/11/08 01:25:12 INFO mapred.LocalJobRunner: reduce task executor complete.
    17/11/08 01:25:12 INFO mapreduce.Job: map 100% reduce 100%
    17/11/08 01:25:12 INFO mapreduce.Job: Job job_local556577714_0001 completed successfully
    17/11/08 01:25:12 INFO mapreduce.Job: Counters: 35
    File System Counters
    FILE: Number of bytes read=16435224
    FILE: Number of bytes written=24996682
    FILE: Number of read operations=0
    FILE: Number of large read operations=0
    FILE: Number of write operations=0
    HDFS: Number of bytes read=38029986
    HDFS: Number of bytes written=7281357
    HDFS: Number of read operations=13
    HDFS: Number of large read operations=0
    HDFS: Number of write operations=4
    Map-Reduce Framework
    Map input records=100000
    Map output records=772904
    Map output bytes=22090342
    Map output materialized bytes=7915822
    Input split bytes=122
    Combine input records=772904
    Combine output records=157240
    Reduce input groups=157240
    Reduce shuffle bytes=7915822
    Reduce input records=157240
    Reduce output records=157240
    Spilled Records=314480
    Shuffled Maps =1
    Failed Shuffles=0
    Merged Map outputs=1
    GC time elapsed (ms)=69
    Total committed heap usage (bytes)=331489280
    Shuffle Errors
    BAD_ID=0
    CONNECTION=0
    IO_ERROR=0
    WRONG_LENGTH=0
    WRONG_MAP=0
    WRONG_REDUCE=0
    File Input Format Counters
    Bytes Read=19014993
    File Output Format Counters
    Bytes Written=7281357

Spark Testing

  • 使用Spark shell執行wordcount,並分別讀取Alluxio與HDFS中的文件,進行測試。

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    # 刪除Alluxio與Hdfs的輸出文件,避免在此步驟出現錯誤
    [hadoop@testmain ~]$ alluxio fs rm -R /wordcount/output
    /wordcount/output has been removed
    [hadoop@testmain ~]$ hdfs dfs -rm -r /wordcount/output
    Deleted /wordcount/output
    # 執行Spark shell
    [hadoop@testmain ~]$ spark-shell --master local[2]
    Spark context Web UI available at http://192.168.128.91:4040
    # ...
    # 對Alluxio中的文件進行測試
    scala> val textFile = sc.textFile("alluxio://192.168.128.91:19998/wordcount/input/page_views.dat")
    textFile: org.apache.spark.rdd.RDD[String] = alluxio://192.168.128.91:19998/wordcount/input/page_views.dat MapPartitionsRDD[1] at textFile at <console>:24
    scala> val counts = textFile.flatMap(line => line.split("\t")).map(word => (word, 1)).reduceByKey(_ + _)
    counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[4] at reduceByKey at <console>:26
    scala> counts.saveAsTextFile("alluxio://192.168.128.91:19998/wordcount/output/spark_alluxio.output")
    # 對HDFS中的文件進行測試
    scala> val textFile = sc.textFile("hdfs://192.168.128.91:9000/wordcount/input/page_views.dat")
    textFile: org.apache.spark.rdd.RDD[String] = hdfs://192.168.128.91:9000/wordcount/input/page_views.dat MapPartitionsRDD[7] at textFile at <console>:24
    scala> val counts = textFile.flatMap(line => line.split("\t")).map(word => (word, 1)).reduceByKey(_ + _)
    counts: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[10] at reduceByKey at <console>:26
    scala> counts.saveAsTextFile("hdfs://192.168.128.91:9000/wordcount/output/spark_alluxio.output")
  • Alluxio result

  • Hdfs result

Summary

  • 在小文件的作業上,Alluxio的讀入數據量與HDFS相同;在執行速度上,HDFS較優
Platform Input of Alluxio Input of HDFS Duration of Alluxio Duration of HDFS
Hadoop 18.2MB 18.2MB 7s 4s
Spark 18.2MB 18.2MB 12s 5s