How is the order in which tar works on files determined?

8,368

Solution 1

As @samiam has stated the list is returned to you in a semi-random order via readdir(). I'll just add the following.

The list returned is what I would call the directory order. On older filesystems, the order is often the creation order that the file entries in the directory's table were added. There is of course a caveat to this, when a directory entry is deleted, this entry is then recycled, so any subsequent files that are stored will replace the previous entry, so the order will no longer by based solely on creation time.

On modern filesystems where directory data structures are based on a search tree or hash table, the order is practically unpredictable.

Examples

Poking at the files created when you run your touch command reveals the following inodes were assigned.

$ touch dir/{{1..8},{a..p}}
$ stat --printf="%n -- %i\n" dir/*
dir/1 -- 10883235
dir/2 -- 10883236
dir/3 -- 10883242
dir/4 -- 10883243
dir/5 -- 10883244
dir/6 -- 10883245
dir/7 -- 10883246
dir/8 -- 10883247
dir/a -- 10883248
dir/b -- 10883249
dir/c -- 10883250
dir/d -- 10883251
dir/e -- 10883252
dir/f -- 10883253
dir/g -- 10883254
dir/h -- 10883255
dir/i -- 10883256
dir/j -- 10883299
dir/k -- 10883302
dir/l -- 10883303
dir/m -- 10883311
dir/n -- 10883424
dir/o -- 10883426
dir/p -- 10883427

So we can see that the brace expansion used by touch creates the filenames in alphabetical order and so they're assigned sequential inode numbers when written to the HDD. (That however does not influence the order in the directory.)

Running your tar command multiple times would seem to indicate that there is an order to the list, since running it multiple times yields the same list each time. Here I've run it 100 times and then compared the runs and they're all identical.

$ for i in {1..100};do tar cJvf file.tar.xz dir/ > run${i};done
$ for i in {1..100};do cmp run1 run${i};done
$ 

If we strategically delete say dir/e and then add a new file dir/ee we can see that this new file has taken the place that dir/e occupied prior in the directories entries table.

$ rm dir/e
$ touch dir/ee

Now let's keep the output from one of the for loop above, just the 1st one.

$ mv run1 r1A

Now if we re-run the for loop that will run the tar command 100 times again, and compare this second run with the previous one:

$ sdiff r1A run1
dir/                                dir/
...
dir/c                               dir/c
dir/f                               dir/f
dir/e                             | dir/ee
dir/o                               dir/o
dir/2                               dir/2
...

We notice that dir/ee has taken dir/e's place in the directories table.

Solution 2

readdir() basically. When tar finds out what files are in a directory, it directly asks the kernel for a file listing via opendir() followed by readdir(). readdir() does not return the files in any particular order; the way the files are ordered depends on the file system being used by the Linux kernel.

There, alas, isn't an option for tar to sort files in subdirectories (adding one is left as an exercise for the reader).

Share:
8,368

Related videos on Youtube

John
Author by

John

Updated on September 18, 2022

Comments

  • John
    John over 1 year
    $ touch dir/{{1..8},{a..p}}
    $ tar cJvf file.tar.xz dir/
    dir/
    dir/o
    dir/k
    dir/b
    dir/3
    dir/1
    dir/i
    dir/7
    dir/4
    dir/e
    dir/a
    dir/g
    dir/2
    dir/d
    dir/5
    dir/8
    dir/c
    dir/n
    dir/f
    dir/h
    dir/6
    dir/l
    dir/m
    dir/j
    dir/p
    

    I would have expected it to be alphabetical. But apparently it's not. What's the formula, here?

  • slm
    slm about 10 years
    I was wondering if it retrieves them based on their inode's value?
  • John
    John about 10 years
    Wow, this is really a great answer. Given a directory, is there any way for me to see what the order that tar will process its sub-items in is? I'm not really confident about it, but how does the following look to you? stat --printf='%i\t-- %n\n' * | sort -n | sed 's/.*\t-- //'
  • samiam
    samiam about 10 years
    I think it's filesystem dependent. I can imagine a btree-type filesystem sorting them based on order of file hash or some such (I have a sense the old ReiserFS orders them differently, since that filesystem dynamically creates inodes)
  • John
    John about 10 years
    @samiam Right, so then what would be the way to do it? Does node -e "console.log(require('fs').readdirSync('.').join('\n'))" look ok? In my tests, the command seems to print all of a given working directory's sub-nodes in alphabetical order very consistently, so maybe it isn't the system call being exposed raw?
  • Michał Politowski
    Michał Politowski about 10 years
    @samiam - right, this answer claims that the 'directory order' is 'the creation order that the file entries in the directory's table were added' and then it itself shows fragments of the tar file contents showing that this is not true. Many filesystems, including current Linux ext* filesystems, use trees and/or hashes in their directory structures, not simple sequential tables like some older filesystems.
  • John
    John about 10 years
    @MichałPolitowski whatever the order is, assuming the filesystem is mounted and the directory in question is the current working directory, what's the easiest way to see the raw result of readdir? Would one need to write a custom C program for it?
  • Admin
    Admin about 10 years
    @John ls -f or ls -U or find -maxdepth 1
  • Matt
    Matt about 10 years
    @slm The f_op->iterate call that glibc readdir() eventually filters down to via getdents() is mapped to a filesystem specific implementation. I can't see anything at a higher level that reorders the dirent's the fs implementation returns.
  • Olivier Dulac
    Olivier Dulac about 10 years
    @John : you could use tar itself to see what it would do : tar cf - /path/to/dir | tar tvf - will tar to stdout, and tar tvf will read from stdout (long, but does exactly what it would do if you were taring) (it's also a nice trick to get a date of files not depending on the age of the files, as "ls" would be). Sorting by inode may work, but not always (inode can "wrap around", and start again with lower numbers than some of the higher ones, so you can't use that if your filesystem created more than X files !)
  • John
    John about 10 years
    @WumpusQ.Wumbley That's perfect, thanks. BUt why does ls have both an -f flag and a -U one?
  • Admin
    Admin about 10 years
    @John the -f flag comes from ancient Unix. Its purpose was to be fast. It disabled sorting, the skipping of dotfiles, and a few other things. The -U flag is a GNU innovation which allows you to disable sorting without any other side effects.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' about 10 years
    @slm No, I've never heard of a filesystem where the inode value would have an influence on the directory order.