Code and Data
The book’s example code is available from GitHub at http://github.com/tomwhite/hadoop-book/.
The code for the third edition is at https://github.com/tomwhite/hadoop-book/tree/3e.
A sample of the NCDC weather dataset that is used throughout the book can be found at https://github.com/tomwhite/hadoop-book/tree/master/input/ncdc/all.
The full dataset is stored on Amazon S3 in the
hadoopbook bucket, and if you have an AWS account you can copy it to a EC2-based Hadoop cluster using Hadoop’s
distcp command (run from a machine in the cluster):
hadoop distcp \ -Dfs.s3n.awsAccessKeyId='...' \ -Dfs.s3n.awsSecretAccessKey='...' \ s3n://hadoopbook/ncdc/all input/ncdc/all
It may be convenient to use Apache Whirr to start a Hadoop cluster on EC2 for this purpose.
Note that the Hadoop cluster has to be running in the US East (Northern Virginia) EC2 Region since access to this S3 bucket is restricted to this region to avoid data transfer fees. (Of course, you are free to copy the data from your EC2 cluster to another cluster in another EC2 region, or outside EC2 entirely, although that will incur standard AWS transfer fees.)