Backing up Hadoop’s HDFS

Very little information can be found on methods of backing up data on a Hadoop cluster. To be clear, by Hadoop cluster I mean HDFS. Any solution would have to take a few things into account. The first is how to keep your data in multiple places for security. There is information on how people have multiple clusters and the data is just copied between the two. As well, and before there’s controversy, replication is not a form of backup. Yes HDFS does have a replication factor and that value is used to keep multiple copies of the file. This does help keep data safe in case of node failure but deleting a file still deletes the file, and once that happens there’s no turning back.

The next issue has to do with the sheer volume of data that can be stored in HDFS. The idea of trying to backup the data on HDFS to tape at first can be quite daunting while at the same time a very enticing idea for off-site/disaster recovery type scenarios. There is no way you can stage all the data in HDFS on some RAID or NAS device before going down to tape – at least not without spending a ton of money on a custom solution from one of the storage vendors.

Our Solution

Here is one approach to backing up HDFS using the time-stamp of the files on HDFS to create a pseudo-incremental backup. We wrote a program that takes four parameters:

  • Parameter #1 is a date in the past
  • Parameter #2 is a maximum amount of data to be downloaded
  • Parameter #3 is the path on HDFS to backup to copy files from
  • Parameter #4 is the path to the staging area to copy files to

The first parameter is used to start backing up files that have a time-stamp equal or greater than the date given. This does mean traversing the entire HDFS directory structure for files and their time-stamps as well as sorting by date. In the big scheme of things the time this takes, as compared to moving all files down to the staging area, is acceptable.

The second parameter is there to stop the program from filling the staging area making the backup crash. This is a way to limit the data the program downloads from HDFS. There is a caveat to this. If the rate of growth of your data in HDFS per backup cycle is greater than the space in the staging area, this aproach will never work as the backup will never catch up. As well, if this type of backup is started after there is data in HDFS it will take some time to catch up to the data created daily. Depending on your data size already in HDFS and data added daily, the amount of space in the staging area might be very large to get all the data down to tape fast enough to catch up.

The last two parameters are used to determine what directories on HDFS are to be backed-up and where they should land on the staging area.

Our tape backup solution can run a script before and after the data has been moved to tape. Before the data is dumped to tape the program described above is run using an initial time-stamp. The program prints out the time-stamp of the last file it downloaded. The staging area is then dumped to tape. The backup program is set up to always do incremental backups of the staging area. Once that is over the time-stamp printed out is taken and set up to be used as the input for the program’s next run. The backup program now thinks it has been doing incremental backups of the same data set even though it has really only been seeing the incremental downloads the program has created.

Even with the caveats presented with this approach has proven to work quite well in that we haven’t needed a large staging area to backup massive amounts of information in a way that we can recover from file deletion and other catastrophes.

Here is an example implementation of the incremental HDFS backup.

  • Facebook
  • HackerNews
  • Reddit
  • Twitter
  • del.icio.us
  • Digg
  • Slashdot
  • StumbleUpon
This entry was posted in Hadoop, HDFS. Bookmark the permalink. Post a comment or leave a trackback: Trackback URL.

One Comment

  1. Posted August 9, 2009 at 9:05 pm | Permalink

    I wonder why you folks are using HDFS. I get that you are using a lot of data, but I am not sure why you want to distribute it on, I imagine, your own commodity hardware? As we are very much in the back up and recovery business, I am just wondering about this choice that you have made.

    Thanks in advance.

Post a Comment

Your email is never published nor shared. Required fields are marked *

*
*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

  • Rapleaf Is Hiring!

    We are looking for engineers who want to solve challenging problems.

    We have great people, do great work, and have great perks.

    Know someone who might be interested? Refer a friend and get $5,000 for successful hires.

    See our current openings at
    www.rapleaf.com/careers