I've just spent a good few hours trying to find any clues to the problem I was getting. du command would fail with a mysterious "fts_read" error, and there didn't seem to be any good answers on the net with explanations why. I figured someday this post will be found and might save someone a lot of time. It's a lengthy post and I believe the first one on this blog to be truly "advanced" in a technical sense.
My scenario and symptoms of the du fts_read problem
Before going into any further details, I'd like to briefly explain what I've been trying to do: my task was to automate scanning of large directories with the du command to later parse this data and do some space usage trends for the most important directories.
Nothing fancy, just a du command in its simplest form:
solaris$ du -k /bigdir
What puzzled me is that du would fail with the fts_read error at seemingly random parts of the /bigdir I was scanning. The fact that I was running it from a cronjob didn't help at all, and the whole problem seemed much more mysterious because the same command line seemed to have worked if run manually, but failed in cronjobs.
The error would appear to be fatal, so du scan would stop and return me something like this:
solaris$ du -k /bigdir ... 15372 /bigdir/storage1 887612 /bigdir/storage2 du: fts_read failed: No such file or directory
The strangest thing was that the files and directories appeared to be all there, nothing was missing and so the error message seemed bogus.
I also think it's the cron involvement that distracted me from fts_read, but as soon as I got a first failure when running the command manually, I had fully turned my attention to fts_read.
What is fts_read?
This was the first question I asked myself after a few consecutive failures of the original command of mine. After looking the fts_read up, I've discovered this in a man page for fts_read:
The fts functions are provided for traversing file hierarchies. A simple overview is that the fts_open () function returns a "handle" on a file hierarchy, which is then supplied to the other fts functions. The function fts_read () returns a pointer to a structure describing one of the files in the file hierarchy.On its own, this didn't help much, and so I ended up looking into the source code of the du command I've used (it was a du from the latest GNU coreutils package):
static bool du_files (char **files, int bit_flags) { bool ok = true; if (*files) { FTS *fts = xfts_open (files, bit_flags, NULL); while (1) { FTSENT *ent; ent = fts_read (fts); if (ent == NULL) { if (errno != 0) { /* FIXME: try to give a better message */ error (0, errno, _("fts_read failed")); ok = false; } break; } FTS_CROSS_CHECK (fts); ok &= process_file (fts, ent); } /* Ignore failure, since the only way it can do so is in failing to return to the original directory, and since we're about to exit, that doesn't matter. */ fts_close (fts); } if (print_grand_total) print_size (&tot_dui, _("total")); return ok; }
The main thing I gained from this code is that the error cannot be worked around, otherwise a solution would be implemented in this function. If fts_read problem occurs, it's a big deal and du terminates right there on the spot.
Further googling also confirmed that the same error was seen across various operating systems, which also suggested it's not an issue specific to my OS (Solaris 10u4) or a du implementation.
The reason fts_read happens in du
Since I couldn't find much more in the code, I had to think about my scenario of using du. Eventually I realized what was happening, and I bet it's the same reason so many others have seen this issue before.
The error message wasn't bogus, and was telling me the exact reason for the du command failing:
du: fts_read failed: No such file or directory
I started double-checking the files and directories and realized that due to the huge size of my /bigdir, it was taking 3h+ to scan it in full. And this means that as du command drilled down into all the subdirectories of it, some files and even whole subdirs were not there anymore.
This surely did upset the du command. Not only it seems suspicious to have a whole directory missing from where you expected it to be, but it's also a problem for calculating and reporting the disk usage stats, and so there's really no other option but to abort the du mission.
This also explained the seemingly random nature of the failure – the dynamics of underlying data in /bigdir isn't distributed evenly, which means that some directories are only changed or removed once a week, while others can be created, processed and removed within a few minutes. It's just that in some cases I was lucky to run du at a relatively quiet time where most of files and directories in /bigdir were'nt moving around.
Is there a workaround for the fts_read du issue?
The fts_read command itself is only a messenger in this case. The real problem occurs on a deeper level, inside the fts_build function which does the actual scan of directories and files in a specified mountpoint.
Not being a developer, I can't really confirm a workaround possibility, but I have a theory, and it's explained below.
Work-around 1: Start at a lower level of subdirectories tree
The first thing to try is to obviously drop the idea of starting your scan this high up in the directory structure. Iinstead of doing
solaris$ du -k /bigdir
start doing
solaris$ du -k /bigdir/storage1/dataset1 solaris$ du -k /bigdir/storage2/dataset2
The idea here is that du will be building and scanning subdirectories tree much further down the directory structure. Each command will take shorter time to run, which means there's less of a chance that some data underneath it will be changed (removed) during the scan.
Work-around 2: Use max-depth to limit how far du will go
Another option that I think might help is the max-depth command line parameter for the du command, it will prevent the du from drilling down all too deep. This means that only the higher-level subdirectories will need the disk usage stats calculated and reported.
Depending on the nature of your data, larger (higher level) subdirectories are less likely to disappear right in the middle of your du scan, and hence the likelihood of fts_read (and, ultimately, fts_build underneath it) not finding something there is much lower.
Now, fts_build function is smart enough to check and double-check the subdirectories when scanning a directory tree, but I believe it can get really upset if it travels down a deep enough subdirectories tree and then finds itself in the middle of nowhere – not only without an immediate subdirectory to scan, but also without a few levels of parent directories.
Here's an example of what I mean:
If we have an unlimited depth for the du command (default behavior), then scanning /bigdir1 might lead fts_build into a directory like this:
/bigdir/storage1/dataset1/subset1/dir1
Now, if during the scan of files in this directory someone decides to remove /bigdir/storage1/dataset1 altogether, fts_build will lose the files it's working on, will attempt to go back one level (try to chdir back to dir1), fail, will attempt to go one level further up (subset1), and may still fail and eventually abort.
We're going to need help of a seasoned developer here to confirm this theory of mine, but it's only to prove the workaround. The reason for the failure still stays the same: the larger your directory structure is, the more likely it is to be dynamically changing, and the more likely it is that some files and directories won't be there at the time of a scan.
Matt Simmons says
Great detective work.
I've got this problem as well. One of my SAN arrays houses a 600GB directory with 1.2mil files (at last count). It takes a while to run du 😉
I think that the solution to this problem is adminstrative rather than technical. As you mentioned, it's possible to run du on subdirectories rather than the big directory. Beyond that, it's actually preferable.
By running du on there subdirs you can monitor their growth discretely rather than in a lump. This allows you to work on specific causes.
I've gotten to the point that I only run du on a couple of giant sub directories. I then take their sum and compare it to the results of 'df' and only look into the problem if there's a large disparity.
If you use this method, remember that du and df will return different numbers due to df checking free inodes and du counting the space used by the files it finds. The difference being that air a file is deleted while it is still open, the file disappears from the directory listing but the inodes don't get freed until the process releases them.
Great post. I'll link to it as soon as I'm off the train.
–Matt
Gleb Reys says
Thanks for your comment!
Yes, for the data sets I personally manage, I lean towards identifying the really specific subdirs rather than looking at the du of a parent directory.
Great point about df, it's even more twisted in my case because I'll be looking at the df/inode counts returned by the filer behind the data, and not the client-side df.
Matt Simmons says
Are you using data dedup? I've always wondered how it handled inode mapping…
Gleb Reys says
We don't use much of it, but if I understand the technology correctly (at least in NetApps), inodes are still used for each of the files. When you're removing duplicated data, you get rid of inode and get rid of references to the original data blocks, that's all
vovets says
Solaris now default filesystem zfs is said to have "snapshots" feature. So you can make snapshot and then use du on the snapshot rather than on the original dataset. That will guarantee that file structure won't chage while du'll be running.
Andrew Gascoyne says
Hi
I simply ran a quick ls of the directory I had the eror on then ran the du without error.
I think it's similar to not being able to stat a directory as the filesystem doesn't recognise it as being properly mounted.