Multiple ways of copying data out of HDFS

There are multiple ways of getting data out of HDFS on to a local machine that does not belong to the HDFS cluster. The method used really depends on the needs of the data-transfer.

The simplest way of getting data out of HDFS on to a non-cluster machine is to use the functions built into the hadoop script.  One example is the -copyToLocal flag.  This will move files one-by-one to the local file system in sequence.  It’s easy to use and fine for moving small number of files.  Moving many files this way will take time.

hadoop fs -copyToLocal hdfs://name-node/path/to/dir /path/to/local/dir

A second method is to use distcp with a local jobtracker.  This adds some complexity but brings in the ability to do more rsync-type operations than just using -copyToLocal. As with -copyToLocal running distcp this way gives you a single thread.  This means there is no parallelism to the copy even though distcp is capable of this so files will be moved over one-by-one.  One thing to note is the file URL does have 3 forward slashes, 2 to define the URL and 1 to define the root of the filesystem.  All file URLS must be from / (slash).

hadoop distcp -jt local hdfs://name-node/path/to/dir file:///path/to/local/dir

The third and final method I’ll go over is to set-up a pseudo-distributed system that uses the cluster’s HDFS for distcp.  This method is a lot more involved but gives you all the features of distcp (rsync-like and parallelism) when copying files.

The first step is to copy the conf/hadoop-site.xml file from the HDFS cluster.  This file should contain all information on how to connect to the HDFS.  You will have to add/change the value for mapred.job.tracker to point to the local machine and add/change value for the number of mappers and reducers.  Usually this equals the number of cores on the local machine.  All other values should stay the same.

<property>
<name>mapred.job.tracker</name>
<value>localmachine.domain:7277</value>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>8</value>
</property>
<property>
<name>mapred.tasktracker.reduce.tasks.maximum</name>
<value>8</value>
</property>

Now make sure the conf/masters file has just “localhost” and the conf/slaves file has the domainname of the local machine.  Once that is done start up a local jobtracker and tasktracker and if everything went well you should be able to see see the jobtracker UI on that machine if you go to the webpage on http://domainname:50030.

bin/hadoop-daemon.sh start jobtracker
bin/hadoop-daemon.sh start tasktracker

Now you should be able to run distcp in a distributed fashion which should give you great improvement when moving large amounts of data out from HDFS.

hadoop distcp hdfs://name-node/path/to/dir file:///path/to/local/dir
This entry was posted in HDFS, Hadoop. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

3 Comments

  1. Venkatesh
    Posted June 26, 2009 at 11:58 pm | Permalink

    You could also use HDFS Proxy, http over hftp, in contrib to get data off the grid.

  2. Posted June 27, 2009 at 4:06 pm | Permalink

    Vankatesh,

    That is true, those are other great ways of getting data from HDFS and one should use the right tool for the job. In my opinion , something like the HDFS Proxy is perfect for when you want to move a couple of files around. Using a pseudo-distributed cluster and distcp gives you parallelism which is great when you want to move large amounts of data.

  3. Posted August 26, 2009 at 1:51 am | Permalink

    Thanks for the solution!

    Even though I don’t have to move around massive amounts of data, this solved my problem of connecting my development machine to the cluster.

One Trackback

  1. [...] One is on different ways of copying data out of a Hadoop cluster’s HDFS.  Being that there’s no good way of mounting HDFS on a local machine getting data out of the cluster is important.  Getting a couple of files isn’t very hard but getting lots of data out (more than 1TB lets say) can be challenging.  The post shows some tricks I’ve been using. [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>