Hadoop - IO_Hadoop 教程

Input to the Mapper as files are read from the HDFS.
Output from the Mapper that is spilled to local disk.
Network I/O between the Reducer and Mapper, as the Reducer’s retrieve files from the Mapper nodes.
Merge to local disk on the Reducer node as the partitions received from the Mapper nodes are fully sorted on the Reducer node.
Reading back from the local disk as records are made available to the reduce method on the Reducer instance.
Output from the Reducer- this is written back to the HDFS.

串行化

能够减少磁盘的占用空间和网络传输的量
Compressed Size, Speed, Splittable
gzip, bzip2, LZO, LZ4, Snappy
要比较各种压缩算法的压缩比和性能
重点：压缩和拆分一般是冲突的（压缩后的文件的block是不能很好地拆分独立运行，很多时候某个文件的拆分点是被拆分到两个压缩文件中，这时Map任务就无法处理，所以对于这些压缩，Hadoop往往是直接使用一个Map任务处理整个文件的分析）
Map的输出结果也可以进行压缩，这样可以减少Map结果到Reduce的传输的数据量，加快传输速率

下一节：Hadoop Distributed File System，分布式文件系统