find: Deleting Files
10.1 Deleting Files
===================
One of the most common tasks that 'find' is used for is locating files
that can be deleted. This might include:
* Files last modified more than 3 years ago which haven't been
accessed for at least 2 years
* Files belonging to a certain user
* Temporary files which are no longer required
This example concentrates on the actual deletion task rather than on
sophisticated ways of locating the files that need to be deleted. We'll
assume that the files we want to delete are old files underneath
'/var/tmp/stuff'.
10.1.1 The Traditional Way
--------------------------
The traditional way to delete files in '/var/tmp/stuff' that have not
been modified in over 90 days would have been:
find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \;
The above command uses '-exec' to run the '/bin/rm' command to remove
each file. This approach works and in fact would have worked in Version
7 Unix in 1979. However, there are a number of problems with this
approach.
The most obvious problem with the approach above is that it causes
'find' to fork every time it finds a file that needs to delete, and the
child process then has to use the 'exec' system call to launch
'/bin/rm'. All this is quite inefficient. If we are going to use
'/bin/rm' to do this job, it is better to make it delete more than one
file at a time.
The most obvious way of doing this is to use the shell's command
expansion feature:
/bin/rm `find /var/tmp/stuff -mtime +90 -print`
or you could use the more modern form
/bin/rm $(find /var/tmp/stuff -mtime +90 -print)
The commands above are much more efficient than the first attempt.
However, there is a problem with them. The shell has a maximum command
length which is imposed by the operating system (the actual limit varies
between systems). This means that while the command expansion technique
will usually work, it will suddenly fail when there are lots of files to
delete. Since the task is to delete unwanted files, this is precisely
the time we don't want things to go wrong.
10.1.2 Making Use of 'xargs'
----------------------------
So, is there a way to be more efficient in the use of 'fork()' and
'exec()' without running up against this limit? Yes, we can be almost
optimally efficient by making use of the 'xargs' command. The 'xargs'
command reads arguments from its standard input and builds them into
command lines. We can use it like this:
find /var/tmp/stuff -mtime +90 -print | xargs /bin/rm
For example if the files found by 'find' are '/var/tmp/stuff/A',
'/var/tmp/stuff/B' and '/var/tmp/stuff/C' then 'xargs' might issue the
commands
/bin/rm /var/tmp/stuff/A /var/tmp/stuff/B
/bin/rm /var/tmp/stuff/C
The above assumes that 'xargs' has a very small maximum command line
length. The real limit is much larger but the idea is that 'xargs' will
run '/bin/rm' as many times as necessary to get the job done, given the
limits on command line length.
This usage of 'xargs' is pretty efficient, and the 'xargs' command is
widely implemented (all modern versions of Unix offer it). So far then,
the news is all good. However, there is bad news too.
10.1.3 Unusual characters in filenames
--------------------------------------
Unix-like systems allow any characters to appear in file names with the
exception of the ASCII NUL character and the slash. Slashes can occur
in path names (as the directory separator) but not in the names of
actual directory entries. This means that the list of files that
'xargs' reads could in fact contain white space characters - spaces,
tabs and newline characters. Since by default, 'xargs' assumes that the
list of files it is reading uses white space as an argument separator,
it cannot correctly handle the case where a filename actually includes
white space. This makes the default behaviour of 'xargs' almost useless
for handling arbitrary data.
To solve this problem, GNU findutils introduced the '-print0' action
for 'find'. This uses the ASCII NUL character to separate the entries
in the file list that it produces. This is the ideal choice of
separator since it is the only character that cannot appear within a
path name. The '-0' option to 'xargs' makes it assume that arguments
are separated with ASCII NUL instead of white space. It also turns off
another misfeature in the default behaviour of 'xargs', which is that it
pays attention to quote characters in its input. Some versions of
'xargs' also terminate when they see a lone '_' in the input, but GNU
'find' no longer does that (since it has become an optional behaviour in
the Unix standard).
So, putting 'find -print0' together with 'xargs -0' we get this
command:
find /var/tmp/stuff -mtime +90 -print0 | xargs -0 /bin/rm
The result is an efficient way of proceeding that correctly handles
all the possible characters that could appear in the list of files to
delete. This is good news. However, there is, as I'm sure you're
expecting, also more bad news. The problem is that this is not a
portable construct; although other versions of Unix (notably BSD-derived
ones) support '-print0', it's not universal. So, is there a more
universal mechanism?
10.1.4 Going back to '-exec'
----------------------------
There is indeed a more universal mechanism, which is a slight
modification to the '-exec' action. The normal '-exec' action assumes
that the command to run is terminated with a semicolon (the semicolon
normally has to be quoted in order to protect it from interpretation as
the shell command separator). The SVR4 edition of Unix introduced a
slight variation, which involves terminating the command with '+'
instead:
find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
The above use of '-exec' causes 'find' to build up a long command
line and then issue it. This can be less efficient than some uses of
'xargs'; for example 'xargs' allows building up new command lines while
the previous command is still executing, and allows specifying a number
of commands to run in parallel. However, the 'find ... -exec ... +'
construct has the advantage of wide portability. GNU findutils did not
support '-exec ... +' until version 4.2.12; one of the reasons for this
is that it already had the '-print0' action in any case.
10.1.5 A more secure version of '-exec'
---------------------------------------
The command above seems to be efficient and portable. However, within
it lurks a security problem. The problem is shared with all the
commands we've tried in this worked example so far, too. The security
problem is a race condition; that is, if it is possible for somebody to
manipulate the filesystem that you are searching while you are searching
it, it is possible for them to persuade your 'find' command to cause the
deletion of a file that you can delete but they normally cannot.
The problem occurs because the '-exec' action is defined by the POSIX
standard to invoke its command with the same working directory as 'find'
had when it was started. This means that the arguments which replace
the {} include a relative path from 'find''s starting point down the
file that needs to be deleted. For example,
find /var/tmp/stuff -mtime +90 -exec /bin/rm {} \+
might actually issue the command:
/bin/rm /var/tmp/stuff/A /var/tmp/stuff/B /var/tmp/stuff/passwd
Notice the file '/var/tmp/stuff/passwd'. Likewise, the command:
cd /var/tmp && find stuff -mtime +90 -exec /bin/rm {} \+
might actually issue the command:
/bin/rm stuff/A stuff/B stuff/passwd
If an attacker can rename 'stuff' to something else (making use of
their write permissions in '/var/tmp') they can replace it with a
symbolic link to '/etc'. That means that the '/bin/rm' command will be
invoked on '/etc/passwd'. If you are running your 'find' command as
root, the attacker has just managed to delete a vital file. All they
needed to do to achieve this was replace a subdirectory with a symbolic
link at the vital moment.
There is however, a simple solution to the problem. This is an
action which works a lot like '-exec' but doesn't need to traverse a
chain of directories to reach the file that it needs to work on. This
is the '-execdir' action, which was introduced by the BSD family of
operating systems. The command,
find /var/tmp/stuff -mtime +90 -execdir /bin/rm {} \+
might delete a set of files by performing these actions:
1. Change directory to /var/tmp/stuff/foo
2. Invoke '/bin/rm ./file1 ./file2 ./file3'
3. Change directory to /var/tmp/stuff/bar
4. Invoke '/bin/rm ./file99 ./file100 ./file101'
This is a much more secure method. We are no longer exposed to a
race condition. For many typical uses of 'find', this is the best
strategy. It's reasonably efficient, but the length of the command line
is limited not just by the operating system limits, but also by how many
files we actually need to delete from each directory.
Is it possible to do any better? In the case of general file
processing, no. However, in the specific case of deleting files it is
indeed possible to do better.
10.1.6 Using the '-delete' action
---------------------------------
The most efficient and secure method of solving this problem is to use
the '-delete' action:
find /var/tmp/stuff -mtime +90 -delete
This alternative is more efficient than any of the '-exec' or
'-execdir' actions, since it entirely avoids the overhead of forking a
new process and using 'exec' to run '/bin/rm'. It is also normally more
efficient than 'xargs' for the same reason. The file deletion is
performed from the directory containing the entry to be deleted, so the
'-delete' action has the same security advantages as the '-execdir'
action has.
The '-delete' action was introduced by the BSD family of operating
systems.
10.1.7 Improving things still further
-------------------------------------
Is it possible to improve things still further? Not without either
modifying the system library to the operating system or having more
specific knowledge of the layout of the filesystem and disk I/O
subsystem, or both.
The 'find' command traverses the filesystem, reading directories. It
then issues a separate system call for each file to be deleted. If we
could modify the operating system, there are potential gains that could
be made:
* We could have a system call to which we pass more than one filename
for deletion
* Alternatively, we could pass in a list of inode numbers (on
GNU/Linux systems, 'readdir()' also returns the inode number of
each directory entry) to be deleted.
The above possibilities sound interesting, but from the kernel's
point of view it is difficult to enforce standard Unix access controls
for such processing by inode number. Such a facility would probably
need to be restricted to the superuser.
Another way of improving performance would be to increase the
parallelism of the process. For example if the directory hierarchy we
are searching is actually spread across a number of disks, we might
somehow be able to arrange for 'find' to process each disk in parallel.
In practice GNU 'find' doesn't have such an intimate understanding of
the system's filesystem layout and disk I/O subsystem.
However, since the system administrator can have such an
understanding they can take advantage of it like so:
find /var/tmp/stuff1 -mtime +90 -delete &
find /var/tmp/stuff2 -mtime +90 -delete &
find /var/tmp/stuff3 -mtime +90 -delete &
find /var/tmp/stuff4 -mtime +90 -delete &
wait
In the example above, four separate instances of 'find' are used to
search four subdirectories in parallel. The 'wait' command simply waits
for all of these to complete. Whether this approach is more or less
efficient than a single instance of 'find' depends on a number of
things:
* Are the directories being searched in parallel actually on separate
disks? If not, this parallel search might just result in a lot of
disk head movement and so the speed might even be slower.
* Other activity - are other programs also doing things on those
disks?
10.1.8 Conclusion
-----------------
The fastest and most secure way to delete files with the help of 'find'
is to use '-delete'. Using 'xargs -0 -P N' can also make effective use
of the disk, but it is not as secure.
In the case where we're doing things other than deleting files, the
most secure alternative is '-execdir ... +', but this is not as portable
as the insecure action '-exec ... +'.
The '-delete' action is not completely portable, but the only other
possibility which is as secure ('-execdir') is no more portable. The
most efficient portable alternative is '-exec ...+', but this is
insecure and isn't supported by versions of GNU findutils prior to
4.2.12.