Multiple ways of copying data out of HDFS

There are multiple ways of getting data out of HDFS on to a local machine that does not belong to the HDFS cluster. The method used really depends on the needs of the data-transfer.

The simplest way of getting data out of HDFS on to a non-cluster machine is to use the functions built into the hadoop script.  One example is the -copyToLocal flag.  This will move files one-by-one to the local file system in sequence.  It’s easy to use and fine for moving small number of files.  Moving many files this way will take time.

hadoop fs -copyToLocal hdfs://name-node/path/to/dir /path/to/local/dir

A second method is to use distcp with a local jobtracker.  This adds some complexity but brings in the ability to do more rsync-type operations than just using -copyToLocal. As with -copyToLocal running distcp this way gives you a single thread.  This means there is no parallelism to the copy even though distcp is capable of this so files will be moved over one-by-one.  One thing to note is the file URL does have 3 forward slashes, 2 to define the URL and 1 to define the root of the filesystem.  All file URLS must be from / (slash).

hadoop distcp -jt local hdfs://name-node/path/to/dir file:///path/to/local/dir

The third and final method I’ll go over is to set-up a pseudo-distributed system that uses the cluster’s HDFS for distcp.  This method is a lot more involved but gives you all the features of distcp (rsync-like and parallelism) when copying files.

The first step is to copy the conf/hadoop-site.xml file from the HDFS cluster.  This file should contain all information on how to connect to the HDFS.  You will have to add/change the value for mapred.job.tracker to point to the local machine and add/change value for the number of mappers and reducers.  Usually this equals the number of cores on the local machine.  All other values should stay the same.

mapred.job.tracker
localmachine.domain:7277

mapred.tasktracker.map.tasks.maximum
8

mapred.tasktracker.reduce.tasks.maximum
8

Now make sure the conf/masters file has just “localhost” and the conf/slaves file has the domainname of the local machine.  Once that is done start up a local jobtracker and tasktracker and if everything went well you should be able to see see the jobtracker UI on that machine if you go to the webpage on http://domainname:50030.

bin/hadoop-daemon.sh start jobtracker
bin/hadoop-daemon.sh start tasktracker

Now you should be able to run distcp in a distributed fashion which should give you great improvement when moving large amounts of data out from HDFS.

hadoop distcp hdfs://name-node/path/to/dir file:///path/to/local/dir

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
This entry was posted in Hadoop, HDFS. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

4 Comments

  1. Venkatesh
    Posted June 26, 2009 at 11:58 pm | Permalink

    You could also use HDFS Proxy, http over hftp, in contrib to get data off the grid.

  2. Posted June 27, 2009 at 4:06 pm | Permalink

    Vankatesh,

    That is true, those are other great ways of getting data from HDFS and one should use the right tool for the job. In my opinion , something like the HDFS Proxy is perfect for when you want to move a couple of files around. Using a pseudo-distributed cluster and distcp gives you parallelism which is great when you want to move large amounts of data.

  3. Posted August 26, 2009 at 1:51 am | Permalink

    Thanks for the solution!

    Even though I don’t have to move around massive amounts of data, this solved my problem of connecting my development machine to the cluster.

  4. Ashok
    Posted September 12, 2011 at 4:23 am | Permalink

    what do you mean by rscync-like can you please elaborate it..

One Trackback

  1. [...] One is on different ways of copying data out of a Hadoop cluster’s HDFS.  Being that there’s no good way of mounting HDFS on a local machine getting data out of the cluster is important.  Getting a couple of files isn’t very hard but getting lots of data out (more than 1TB lets say) can be challenging.  The post shows some tricks I’ve been using. [...]

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers