Friday, July 05, 2013

Hadoop hdfs Trash facility - A short introduction

We all know that Hadoop's filesytem has a Trash facility, when files got deleted, they are not actually real deleted, instead, they are moved to Trash folder. The Trash folder keeps all deleted files for a minimum period before being permanetly deleted by the ststem. This "minimum period" is defined by fs.trash.interval in minutes in the core-site.xml file.

Like any other OS, trash facility is a user-level feature, this means only files that are deleted using the filesystem shell are put in trash. Files deleted programmatically are deleted immediately. It is possible to use the trash programmatically, you need to construct a Trash instance then calling the method moveToTrash() with the path of the file intended for deletion. moveToTrash()returns "false" means either trash is not enabled or that the file is already in trash.


When trash is enabled, each user has it own trash directory called ".Trash" in the user's home directory. File recovery is simple: you look for the file in a subdirectory of ".Trash" and move it out of the trash subtree. HDFS will automatically delete files in trash folders on a periodically basis. As I mentioned earlier, the interval is defined by fs.trash.interval in core-site.xml, the unit it minute. Here is an example:

<property>
  <name>fs.trash.interval</name>
  <value>0</value>
  <description>Number of minutes after which the checkpoint
  gets deleted.
  If zero, the trash feature is disabled.
  </description>
</property>

You can expunge the trash, which will delete files that have been in the trash longer than their minimum period, using the filesystem shell:
% hadoop fs -expunge
The Trash class exposes an expunge() method that has the same effect.

The "-skipTrash" option:
You can use "hadoop fs -rmr -skipTrash /user/xx/.Trash" command to delete the whole ".Trash" folder. Please pay special attention to this option, it will DELETE everything in .Trash include the .Trash folder. Here is an example of how to use it:

$ hadoop fs -ls /user/tony/sample.txt
Found 1 items
-rw-r--r--   3 tony supergroup        530 2013-07-05 10:57 /user/tony/sample.txt

$ hadoop fs -rm /user/tony/sample.txt
Moved: 'hdfs://nameservice1/user/tony/sample.txt' to trash at: hdfs://nameservice1/user/tony/.Trash/Current

$ hadoop fs -ls /user/tony/.Trash/Current/user/tony/sample.txt
Found 1 items
-rw-r--r--   3 tony supergroup        530 2013-07-05 10:57 /user/tony/.Trash/Current/user/tony/sample.txt

$ hadoop dfs -rmr -skipTrash /user/tony/.Trash
Deleted /user/tony/.Trash

# You can see .Trash is gone
$ hadoop fs -ls /user/tony/
Found 3 items
drwx------   - tony supergroup          0 2013-06-27 13:41 /user/tony/.staging
drwxr-xr-x   - tony supergroup          0 2013-06-05 14:50 /user/tony/input
drwxr-xr-x   - tony supergroup          0 2013-06-04 10:26 /user/tony/output

# Now we going to recreate the .Trash
$ hadoop fs -copyFromLocal sample.txt /user/tony/

$ hadoop fs -rm /user/tony/sample.txt
Moved: 'hdfs://nameservice1/user/tony/sample.txt' to trash at: hdfs://nameservice1/user/tony/.Trash/Current

# .Trash is back
$ hadoop fs -ls /user/tony/
drwx------   - tony supergroup          0 2013-07-05 11:10 /user/tony/.Trash
drwx------   - tony supergroup          0 2013-06-27 13:41 /user/tony/.staging
drwxr-xr-x   - tony supergroup          0 2013-06-05 14:50 /user/tony/input
drwxr-xr-x   - tony supergroup          0 2013-06-04 10:26 /user/tony/output

No comments: