Hadoop comes with a distributed filesystem called HDFS, which stands for Hadoop Distributed File System. Although Hadoop supports many other filesystems (e.g., Amazon S3), HDFS is the most popular choice and will be used throughout this bootcamp. Therefore, in this section, you will learn how to move data between your local filesystem and HDFS.
Hadoop provides a command line utility
hdfs to interact with HDFS. Basic operations are placed under
hdfs dfs subcommand. Let's play with some basic operations.
When you use HDFS for the first time, it's likely that your home directory in HDFS has not been created yet. Your home directory in HDFS is
/user/<username>/ by default. In the environment that we provide, there's a special user
hdfs who is an HDFS administrator and has the permission to create new home directories.
You will first need to switch to the
hdfs user via
Then, you can create a directory and change ownership of the newly created folder
Please remember to change
<username> to your actual linux user name (i.e. user2). Finally switch back to your user with
Similar to creating local directory via linux command
mkdir, creating a folder named
input in HDFS use
hdfs is the HDFS utility program,
dfs is the subcommand to handle basic HDFS operations,
-mkdir means you want to create a directory and the directory name is specified as
input. Above commands actually create the
input directory in your home directory in HDFS. Of course, you can create it to other place with absolute or relative path.
Suppose you followed previous instructions and created an directory named
input, you can then copy data from local file system to HDFS using
-put. For example,
You can find detailed description about these two files in sample data.
-get operation will copy data out of HDFS to the local folder. For example
will copy the
input/case.csv file out HDFS into the current working directory using a new name
local_case.csv. If you didn't specify
local_case.csv, the original name
case.csv will be kept. You will be able to verify your copy by
-cat operation described below.
Just like linux
-ls is the operation to list files and folders in HDFS. For example, the following command list items in your home directory of HDFS (i.e
You can see the newly created
input directory is listed. You can also see the files inside a particular directory
Actually you don't need to copy files out local in order to see its content, you can directly use
-cat to printing the content of files in HDFS. For example, the following command print out content of the one file you just put into HDFS.
You will find wildcard character very useful since output of MapReduce and other Hadoop-based tools tendsto be directory. For example, to print content of all csv files (the case.csv and control.csv) in the
input HDFS folder, you can
For more detailed usage of different commands and parameters, you can type
You may miss the
-r option and get error.
-r tells HDFS to remove recursively. This is similar to linux command