Why size reporting for directories is different than other files?

files filesystems ls

6,595

Solution 1

I think the reason you're confused is because you don't know what a directory is. To do this lets take a step back and examine how Unix filesystems work.

The Unix filesystem has several separate notions for addressing data on disk:

data blocks are a group of blocks on a disk which have the contents of a file.
inodes are special blocks on a filesystem, with a numerical address unique within that filesystem, which contains metadata about a file such as:
- permissions
- access / modification times
- size
- pointers to the data blocks (could be a list of blocks, extents, etc)
filenames are hierarchical locations on a filesystem root that are mapped to inodes.

In other words, a "file" is actually composed of three different things:

a PATH in the filesystem
an inode with metadata
data blocks pointed to by the inode

Most of the time, users imagine a file to be synonymous to "the entity associated with the filename" - it's only when you're dealing with low-level entities or the file/socket API that you think of inodes or data blocks. Directories are one of those low-level entities.

You might think that a directory is a file that contains a bunch of other files. That's only half-correct. A directory is a file that maps filenames to inode numbers. It doesn't "contain" files, but pointers to filenames. Think of it like a text file that contains entries like this:

. - inode 1234
.. - inode 200
Documents - inode 2008
README.txt - inode 2009

The entries above are called directory entries. They are basically mappings from filenames to inode numbers. A directory is a special file that contains directory entries.

That's a simplification of course, but it explains the basic idea and other directory weirdness.

Why don't directories know their own size?
- Because they only contain pointers to other stuff, you have to iterate over their contents to find the size
Why aren't directories ever empty?
- Because they contain at least the . and .. entries. Thus, a proper directory will be at least as small as the smallest filesize that can contain those entries. In most filesystems, 4096 bytes is the smallest.
Why is it that you need write permission on the parent directory when renaming a file?
- Because you're not just changing the file, you're changing the directory entry pointing to the file.
Why does ls show a weird number of "links" to a directory?
- a directory can be referenced (linked to) by itself, its parent, its children.
What does a hard link do and how does it differ from a symlink?
- a hard link adds a directory entry pointing to the same inode number. Because it points to an inode number, it can only point to files in the same filesystem (inodes are local to a filesystem)
- a symlink adds a new inode which points to a separate filename. Because it refers to a filename it can point to arbitrary files in the tree.

But wait! Weird things are happening!

ls -ld somedirectory always shows the filesize to be 4096, whereas ls -l somefile shows the actual size of a file. Why?

Point of confusion 1: when we say "size" we can be referring to two things:

filesize, which is a number stored in the inode; and
allocated size, which is the number of blocks associated with the inode times the size of each block.

In general, these are not the same number. Try running stat on a regular file and you'll see this difference.

When a filesystem creates a non-empty file, it usually eagerly allocates data blocks in groups. This is because files have a tendency to grow and shrink arbitrarily fast. If the filesystem only allocated as many data blocks as needed to represent the file, growing / shrinking would be slower, and fragmentation would be a serious concern. So in practice, filesystems don't have to keep reallocating space for small changes. This means that there may be a lot of space on disk that is "claimed" by files but completely unused.

What does the filesystem do with all this unused space? Nothing. Until it feels like it needs to. If your filesystem optimizer tool - maybe an online optimizer running in the background, maybe part of your fsck, maybe built-in to your filesystem itself - feels like it, it may reassign the data blocks of your files - moving used blocks, freeing unused blocks, etc.

So now we come to the difference between regular files and directories: because directories form the "backbone" of your filesystem, you expect that they may need to be accessed or modified frequently and should thus be optimized. And so you don't want them fragmented at all. When directories are created, they always max out all their data blocks in size, even when they only have so many directory entries. This is okay for directories, because, unlike files, directories are typically limited in size and growth rate.

The 4096 reported size of directories is the "filesize" number stored in the directory inode, not the number of entries in the directory. It isn't a fixed number - it's the maximum bytes that will fit into the allocated number of blocks for the directory. Typically, this is 512 bytes/block times 8 blocks allocated for a file with any contents - incidentally, for directories, the filesize and the allocated size are the same. Because it's allocated as a single group, the filesystem optimizer won't move its blocks around.

As the directory grows, more data blocks are assigned to it, and it will also max out those blocks by adjusting the filesize accordingly.

And so ls and stat will show the filesize field of the directory's inode, which is set to the size of the data blocks assigned to it.

Solution 2

I think that the initial, empty, directory size depends on the filesystem. On ext3 and ext4 filesystems I have access to, I also get 4096-byte empty directories. On an NFS-mounted NAS of some sort, I get an 80-byte empty directory. I don't have access to a ReiserFS filesystem, the newly-created, empty directory size there would be interesting.

Traditionally, a directory was a file with a bit set in its inode (the on-disk structure describing the file) that indicated it was a directory. That file was filled with variable-length records. Here's what /usr/include/linux/dirent.h says:

struct dirent64 {
    __u64       d_ino;
    __s64       d_off;
    unsigned short  d_reclen;
    unsigned char   d_type;
    char        d_name[256];
};

You could skip through the directory-file-entries by using the d_off values. If an entry got removed (unlink() system call, used by rm command), the d_off value of the previous entry got increased to account for the missing record. Nothing did any "compacting" of records. It was probably just simplest to show the allocation in terms of the number of bytes in the disk blocks allocated to the file, rather than try to figure out how many bytes in a directory file account for all of the entries, or just up to the last entry.

These days, directories have internal formats like B-trees or Hash Trees. I'm guessing that it's either a big performance improvement to do directories by blocks, or there's "blank space" inside them similar to old school directories, so it's hard to decide what the "real size" in bytes of a directory is, particularly one that's been in use for a while and had files deleted and added to it a lot. Easier just to show number-of-blocks multiplied by bytes-per-block.

Solution 3

A file may have no blocks allocated to it; the -s flag to ls will show this difference, while a directory will have some number of minimum blocks allocated, hence the default size. (Unless you're on some fancy modern filesystem that throws these notions out the window.) For example:

% mkdir testfoo
% cd testfoo/
% mkdir foodir
% touch foofile
% ln -s foofile foosln
% ls -ld foo*
drwxrwxr-x  2 jmates  jmates  512 Oct  5 19:48 foodir
-rw-rw-r--  1 jmates  jmates    0 Oct  5 19:48 foofile
lrwxrwxr-x  1 jmates  jmates    7 Oct  5 19:48 foosln -> foofile
% ls -lds foo*
8 drwxrwxr-x  2 jmates  jmates  512 Oct  5 19:48 foodir
0 -rw-rw-r--  1 jmates  jmates    0 Oct  5 19:48 foofile
0 lrwxrwxr-x  1 jmates  jmates    7 Oct  5 19:48 foosln -> foofile
%

Note that the symlink here takes no blocks, despite dedicating seven bytes for the details necessary to readlink(2), how curious! Anyways, let's now pad foofile with a byte or two:

% echo >> foofile a
% ls -lds foo*
8 drwxrwxr-x  2 jmates  jmates  512 Oct  5 19:48 foodir
8 -rw-rw-r--  1 jmates  jmates    2 Oct  5 19:49 foofile
0 lrwxrwxr-x  1 jmates  jmates    7 Oct  5 19:48 foosln -> foofile
%

And one can see that the allocated blocks for foofile has jumped to 8 despite there being only two bytes (the a and the newline echo tacked on).

Files can also be sparse, which is another way the reported file size versus actual contents can differ, depending on how the tool interacting with the file handles that sparseness.

Also, the size of the directory can be increased, create many files with very long names and check what happens to the size of the directory (and to the blocks allocated) after each new long filename is created with ls -lds .

6,595

Utku

Updated on September 18, 2022

Comments

Utku over 1 year
I was wondering why an empty directory occupied 4096 bytes of space and I have seen this question. It is stated that space is allocated in blocks and hence, the size of a new directory is 4096 bytes.

However I am pretty sure that allocation for "normal" files are done in blocks as well. At least it is like that in Windows filesystems and I am guessing that it must be at least similar in ext*.

Now as far as I understood, size listing for other types of files, such as files, symbolic links etc. are done in terms of real size. Because when I create an empty file, I see a 0 as the size. When a type a few characters, I see the < number of characters > bytes as the size etc.

So my question is, although the allocation for other files are done in blocks too, why the policy for reporting the size of a directory and a file differs?

Clarification

I thought the question was clear enough but apparently is wasn't. I will try to clarify the question here.

1) What I think a directory is:

I will try to explain what I think a directory is by the following example. After reading, if it is wrong, please notify me.

Let's say that we have a directory named mydir. And let's say that it contains 3 files, which are: f0, f1 and f2. Let's assume that each file is 1 byte long.

Now, what is mydir? It is a pointer to an inode which contains the following: String "f0" and the inode number which f0 points to. String "f1" and the inode number which f1 points to. And string "f2" and the inode number which f2 points to. (At least this is what I think a directory is. Please correct me if I am wrong.)

Now there may be two methods for calculating the size of a directory:

1) Calculating the size of the inode which mydir points to.

2) Summing the sizes of the inodes which contents of mydir points to.

Although 1 is more counter intuitive, let's assume that it is the method that is being used. (For this question, which method is the method that is actually being used does not matter.) Then, the size of mydir is calculated as the following:
```
2 + 2 + 2 + 3 * <space_required_to_store_an_inode_number>
```
2's are because each filename is 2 bytes long.

2) The question:

Now the question: Assuming what I think a directory is correct, the reported size for mydir should be much much less than 4096, no matter method 1 or method 2 is being used to calculate its size.

Now, you will say that the reason it is reported 4096 bytes is because the allocation is done in blocks. Hence, the reported size that big.

But then I will say: Allocation is done in blocks for regular files as well. (See thrig's answer for reference) But nevertheless, their sizes are reported in real sizes. (1 byte if they contain 1 character, 2 bytes if they contain 2 characters etc.)

So my question is, why is the policy for reporting sizes of directories is such different than reporting sizes of regular files?

More clarification:

We know that the initial number of blocks allocated for a non-empty file and for an empty directory is both 8 blocks. (See thrig's answer) So even though allocation is made in the same number of blocks for both regular files and directories, why the reported size for a directory is much bigger?
Utku over 8 years

But then why is it easy to report the real size of a "normal" file but not of a directory?
Utku over 8 years

Then is the answer the following: Files are reported with the real size because this real size can be made use of but since such scenario is not possible for directories, the file system (or whatever else, I am not sure what) doesn't go into the trouble of determining and reporting the real size of a directory?
Admin over 8 years

@Utku - I think that sounds correct, and is very concise.
Utku over 8 years

[1] To see if I understood: Number of blocks allocated for foofile was initially 0 because it was empty. Hence, foofile was not pointing to an inode. But after making a smallest change to foofile, an inode had to be assigned to it and the filesystem allocated the smallest number of allocatable blocks to it. Is that right?
Utku over 8 years

[2] Even though, there are still 3 questions: 1) Why foosln does not occupy any blocks? It is non empty since the moment it was created. Hence, it feels like it should occupy some blocks upon creation. 2) Why the smallest number of allocatable blocks is 8? (or is it?) Shouldn't it be 1? 3) And also, although I now know that files too are allocated in blocks, still I don't know why the size of a directory is reported as the total size of the blocks it occupies vs. the size of a file is reported as its real size?
Utku over 8 years

[1] This is an awesome answer, thanks but I already knew these and it doesn't answer my question. Let me clarify: Let's say there is a directory called mydir. And let's say it contains some files such as: f0, f1 and f2. Now, what is mydir? It is a pointer to an inode which contains the following: String "f0" and inode number which it points to. String "f1" and inode number which it points to. String "f2" and inode number which it points to. (At least this is the picture in my mind. It might be wrong) So far so good.
Utku over 8 years

[2] Now, we must decide what we mean by the size of a directory. One of the two options is defining it as only the size of the inode mydir points to. Not adding the sizes of the inodes which contents of the directory points to. The other way might be defining as the sum of the sizes of inodes which are pointed by the directory's contents. For simplicity, if we assume that it is calculated w.r.t the former definition, the size of mydir should be: 2 + 2 + 2 + 3*<size required to store an inode number>. 2's are because each filename in mydir is two characters long.
Utku over 8 years

[3] Thinking this way, the reported size of an empty directory must be way less than 4096 bytes. Now you will say that allocation is done in blocks. Hence, the reported size is so large. But then I will say: Allocation for regular files are done in blocks as well. But their sizes are reported as their real sizes. My question is this. What is the reason for such different policies in reporting the sizes of a file vs a directory.
madumlao over 8 years

I would make a difference between filesize and allocated size. Filesystems may, at their discretion, use different techniques for allocating blocks - in general case the inode contains a "block list" pointing to the data blocks, some filesystems can store the file data in the inode's block itself, some filesystems can have the inode state a starting / ending block, some may split / allocate blocks between files, etc. In other words, there is no guarantee in the general case that the file "owns" the whole block. The only size "owned" by the file is the actual content (not the inode).
madumlao over 8 years

Directories, however, are special files and may be treated differently by the filesystem with regards to filesize / block allocation. Space allocated on the disk for a directory is probably also exclusively owned by the directory for optimization purposes (to allow a directory's contents to be read/written faster). So I would think of the difference NOT as a reporting difference, but as an allocation difference. The directory "owns" the whole block when created, a regular file, not necessarily so. Which would explain why directory size differs depending on the filesystem.
madumlao over 8 years

(if the above clarifies it, I'll incorporate it into the answer)
madumlao over 8 years

The "inode" is an entity separate from the data blocks. The inode can be thought of as metadata with a pointer to the blocks (usually a list). In the case of a symlink, there are NO blocks - the inode itself contains the filename being pointed to.
Utku over 8 years

Yes, it definitely clarified some but not all. For example, in thrig's answer, we see that the number of blocks allocated for the file foofile is 8, as soon as something is written in foofile. This is the same number of allocated blocks for a directory. Now according to what you say, a directory owns each and every byte of these 8 blocks and hence, its size is 4096 (or 512 in thrig's case) bytes. But this is not the case for foofile. Then why is foofile assigned 8 blocks, even if it doesn't own every byte of it?
madumlao over 8 years

Yes. Traditionally, inodes have blocklists, and ls -s shows the size of that list. Symlinks have zero blocks and thus will always report zero unles you use low-level tools to edit their data blocks. I don't think filesystems are necessarily bound by the blocklist metaphor. RAM-based filesystems, or database filesystems, or other FUSE stuff can "cheat" this somehow. And I'm pretty sure reiserfs has "small files support" which can store the data of a small file inside the inode block if it will fit. Don't know how reiser reports ls -s for small files though. zero? 1?
Utku over 8 years

@madumlao Awesome. The only remaining question is; even though a regular file does not "own" the whole 8 blocks when it is created, (Created, in the sense of became a non-empty file) why it is assigned these 8 blocks? On contrary to a directory, which "owns" all of these 8 blocks?
Utku over 8 years

By the way, when I check info ls, I see the following for -s option: "Display the number of file system blocks actually used by each file, in units of 512 bytes, where partial units are rounded up to the next integer value. If the output is to a terminal, a total sum for all the file sizes is output on a line before the listing. The environment variable BLOCKSIZE overrides the unit size of 512 bytes."
madumlao over 8 years

@Utku the strategy on how to allocate blocks is the filesystem's prerogative. In practice, most files grow within certain thresholds so the FS can eager-allocate data blocks in intervals. However, when the FS starts to get fragmented/full, it may choose to change strategies on how blocks are assigned, and/or FS tools may reallocate blocks differently. So it helps if the FS knows that your file is only "really" using 200 blocks out of its 400 assigned when the optimizer program needs to run. OTOH, directories should always be optimized, thus always at parity with their allocation.
Utku over 8 years

This is a bit ambiguous right? "Display the number of file system blocks actually used by each file, in units of 512 bytes, where partial units are rounded up to the next integer value." What I understand from this explanation is the following: Total number of bytes used by the file is calculated. All these bytes are legit, useful bytes. Then, this number is divided by 512. The result is rounded up to next integer and reported.
Utku over 8 years

If that's the case, this is not consistent with reporting 8 for a file of two bytes. So -s must be working differently but I don't understand how it works.
Utku over 8 years

[1] @madumlao I see. So I guess the blocks for files are initially eager-allocated (which is actually an FS policy) in an attempt to prevent future fragmentation, but the reported size is the real size and hence it is reported way less.
Utku over 8 years

[2] @madumlao In this manner, when the empty space in a filesystem becomes more sparse, the FS may assign some of these blocks to a newly created file. Because otherwise, this new file could not be created.
Utku over 8 years

[3] @madumlao But this is not the case for a directory. No matter how full the FS becomes, the FS will not deallocate any blocks of a directory to make room for a new file.
Utku over 8 years

[4] @madumlao Is this correct? Did I understand it right?
madumlao over 8 years

The GNU coreutils (8.23) info ls documentation is much clearer as it only talks about "disk allocation", without file size calculations. I suspect that it's just the documentation making up for Unix madness. ls -s should be showing the same info as stat --format=%b does, which has a field for the number of data blocks. Also check stat and you'll see there is a general case difference between allocated blocks and file size.
Utku over 8 years

Yes I checked it. It is more clear. Btw what do you mean by "it's just the documentation making up for Unix madness"? :)
madumlao over 8 years

@Utku, that's as best as I can understand it as well.
Utku over 8 years

@madumlao I see. I am marking your answer as accepted then. Maybe if I get a more deeper understanding of the FS and the reasoning behind this policy, I will contribute with an answer as well. Thanks for the awesome answers btw :)