There are multiple ways of getting data out of HDFS on to a local machine that does not belong to the HDFS cluster. The method used really depends on the needs of the data-transfer.
The simplest way of getting data out of HDFS on to a non-cluster machine is to use the functions built into the hadoop script. One example is the -copyToLocal flag. This will move files one-by-one to the local file system in sequence. It’s easy to use and fine for moving small number of files. Moving many files this way will take time.
A second method is to use distcp with a local jobtracker. This adds some complexity but brings in the ability to do more rsync-type operations than just using -copyToLocal. As with -copyToLocal running distcp this way gives you a single thread. This means there is no parallelism to the copy even though distcp is capable of this so files will be moved over one-by-one. One thing to note is the file URL does have 3 forward slashes, 2 to define the URL and 1 to define the root of the filesystem. All file URLS must be from / (slash).
The third and final method I’ll go over is to set-up a pseudo-distributed system that uses the cluster’s HDFS for distcp. This method is a lot more involved but gives you all the features of distcp (rsync-like and parallelism) when copying files.
The first step is to copy the conf/hadoop-site.xml file from the HDFS cluster. This file should contain all information on how to connect to the HDFS. You will have to add/change the value for mapred.job.tracker to point to the local machine and add/change value for the number of mappers and reducers. Usually this equals the number of cores on the local machine. All other values should stay the same.
<name>mapred.job.tracker</name>
<value>localmachine.domain:7277</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>8</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>8</value>
</property>
Now make sure the conf/masters file has just “localhost” and the conf/slaves file has the domainname of the local machine. Once that is done start up a local jobtracker and tasktracker and if everything went well you should be able to see see the jobtracker UI on that machine if you go to the webpage on http://domainname:50030.
bin/hadoop-daemon.sh start tasktracker
Now you should be able to run distcp in a distributed fashion which should give you great improvement when moving large amounts of data out from HDFS.

3 Comments
You could also use HDFS Proxy, http over hftp, in contrib to get data off the grid.
Vankatesh,
That is true, those are other great ways of getting data from HDFS and one should use the right tool for the job. In my opinion , something like the HDFS Proxy is perfect for when you want to move a couple of files around. Using a pseudo-distributed cluster and distcp gives you parallelism which is great when you want to move large amounts of data.
Thanks for the solution!
Even though I don’t have to move around massive amounts of data, this solved my problem of connecting my development machine to the cluster.
One Trackback
[...] One is on different ways of copying data out of a Hadoop cluster’s HDFS. Being that there’s no good way of mounting HDFS on a local machine getting data out of the cluster is important. Getting a couple of files isn’t very hard but getting lots of data out (more than 1TB lets say) can be challenging. The post shows some tricks I’ve been using. [...]