What are the performance implications for millions of files in a modern file system?

linux nginx performance filesystems linux-kernel

10,484

Solution 1

The reason one would create this sort of directory structure is that filesystems must locate a file within a directory, and the larger the directory is, the slower that operation.

How much slower depends on the filesystem design.

The ext4 filesystem uses a B-tree to store directory entries. A lookup on this table is expected to take O(log n) time, which most of the time is less than the naive linear table that ext3 and previous filesystems used (and when it isn't, the directory is too small for it to really matter).

The XFS filesystem uses a B+tree instead. The advantage of this over a hash table or B-tree is that any node may have multiple children b, where in XFS b varies and can be as high as 254 (or 19 for the root node; and these numbers may be out of date). This gives you a time complexity of O(log_b n), a vast improvement.

Either of these filesystems can handle tens of thousands of files in a single directory, with XFS being significantly faster than ext4 on a directory with the same number of inodes. But you probably don't want a single directory with 3M inodes, as even with a B+tree the lookup can take some time. This is what led to creating directories in this manner in the first place.

As for your proposed structures, the first option you gave is exactly what is shown in nginx examples. It will perform well on either filesystem, though XFS will still have a bit of an advantage. The second option may perform slightly better or slightly worse, but it will probably be pretty close, even on benchmarks.

Solution 2

In my experience, one of the scaling factors is the size of the inodes given a hash-name partitioning strategy.

Both of your proposed options creates up to three inode entries for each created file. Also, 732 files will create an inode that is still less than the usual 16KB. To me, this means either option will perform the same.

I applaud you on your short hash; previous systems I've worked on have taken the sha1sum of the given file and spliced directories based on that string, a much harder problem.

Solution 3

Certainly either option will help reduce the number of files in a directory to something that seems reasonable, for xfs or ext4 or whatever file system. It is not obvious which is better, would have to test to tell.

Benchmark with your application simulating something like the real workload is ideal. Otherwise, come up with something that simulates many small files specifically. Speaking of that, here's an open source one called smallfile. Its documentation references some other tools.

hdparm doing sustained I/O isn't as useful. It won't show the many small I/Os or giant directory entries associated with very many files.

Solution 4

One of the issues is the way to scan the folder.

Imagine Java method which runs scan on the folder.

It will have to allocate large amount of memory and deallocate it in short period of time which is very heavy for the JVM.

The best way is to arrange the folder structure the way that each file is in dedicated folder e.g. year/month/day.

The way full scan is done is that for each folder there's one run of the function so JVM will exit the function, deallocate RAM and run it again on another folder.

This is just example but anyway having such huge folder makes no sense.

Solution 5

I've been having the same issue. Trying to store millions of files in a Ubuntu server in ext4. Ended running my own benchmarks. Found out that flat directory performs way better while being way simpler to use:

Wrote an article.

View more solutions

10,484

Leandro Moreira

Updated on September 18, 2022

Comments

Leandro Moreira over 1 year

Let's say we're using ext4 (with dir_index enabled) to host around 3M files (with an average of 750KB size) and we need to decide what folder scheme we're going to use.

In the first solution, we apply a hash function to the file and use two levels folder (being 1 character for the first level and 2 characters to second level): therefore being the filex.for hash equals to abcde1234, we'll store it on /path/a/bc/abcde1234-filex.for.

In the second solution, we apply a hash function to the file and use two levels folder (being 2 characters for the first level and 2 characters to second level): therefore being the filex.for hash equals to abcde1234, we'll store it on /path/ab/de/abcde1234-filex.for.

For the first solution we'll have the following scheme /path/[16 folders]/[256 folders] with an average of 732 files per folder (the last folder, where the file will reside).

While on the second solution we'll have /path/[256 folders]/[256 folders] with an average of 45 files per folder.

Considering we're going to write/unlink/read files (but mostly read) from this scheme a lot (basically the nginx caching system), does it maters, in a performance sense, if we chose one or other solution?

Also, what are the tools we could use to check/test this setup?
- ewwhite over 7 years
  
  Obviously benchmarking will help. But ext4 may be the wrong filesystem for this. I'd be looking at XFS.
- Michael Hampton over 7 years
  
  I wouldn't just look at XFS, I'd immediately use it without further ado. The B+tree beats the hash table every time.
- Leandro Moreira over 7 years
  
  Thanks for the tips, benchmarking is a little bit hard though, I tried to hdparm -Tt /dev/hdX but it might not be the most appropriated tool.
- HBruijn over 7 years
  
  No hdparm is not the right tool, it is a check of raw performance of the block device and not a test of the file system.
user207421 over 7 years

You're assuming Java and scanning the folder. Neither is mentioned.in the question, and there are other ways to process the folder in Java besides scanning it.
Hagen von Eitzen over 7 years

Strictly speaking, $O(\log_b n)$ for fixed $b$ is the same complexity as $O(\log n)$. But to the OP, the actual constants would matter.
Abhi Beckert over 5 years

Unless there's something wrong with my filesystem, ext4 cannot handle 10,000 files in a single directory. Doing a simple ls -l takes a full minute if the directory has dropped off the inode cache. And when it is cached, it still takes over a second. This is with an SSD and a Xeon with tons of RAM on a fairly low traffic web server.
Michael Hampton over 5 years

@AbhiBeckert Was it upgraded from ext3? If so, try making a new directory and move the files to it.
Abhi Beckert over 5 years

@Hampton No. it’s a (fairly) recently setup server on modern hardware. I’ve been working on the issue with our sysadmin/data centre for a couple months. We’re paying thousands of dollars per month to lease the server and not getting acceptable performance out of it. It’s looking like the only option is to move to a new directory structure - perhaps using hashes instead of dates for filenames to spread it out more evenly.
Michael Hampton over 5 years

@AbhiBeckert I can only suggest that you switch to XFS and adopt some sort of directory structure as described above.
Abhi Beckert over 5 years

@MichaelHampton it looks like the issue I'm having is due to Ext's directory database not completely being pruned after a child is deleted (unless you run a slow operation that requires unmounting the disk). The directory in question doesn't have many files in it now, but it may have historically. I haven't been able to find a good description of the issue, only vague references to it unfortunately.
Michael Hampton over 5 years

That is definitely not the expected result. Before you go with this or recommend it, you should look deeper into why you got this unexpected result.
w00t over 4 years

So to look up a file by name, in either option the fs will have to perform 3 lookups, 2 of those on very shallow trees and one in a deep tree. If all files are in one directory, the fs has to perform only 1 lookup, in a slightly deeper tree. I don't think it's clear-cut which is faster.
Íhor Mé over 3 years

Yeah, there's definitely a bug somewhere here, this is simply impossible. Also, the "read" for "deep directory structure" is so obviously unrealistic.
Mikko Rantalainen about 3 years

Also note that the filesystem may have totally different performance for huge amount of files in a single directory vs a huge amount of (sub)directories in a single directory. This is because directory contains technically a hardlink from entry ".." to parent directory which files don't need. As a result, if you have huge amount of subdirectories, the total amount of hardlinks to directory is count of subdirectories + 1. That said, we use similar fan-out setup and we use 256 directories per level. Seems to work just fine for over 10 million files.
Mikko Rantalainen about 3 years

I agree. We need more information what was exactly benchmarked here.
Lothar over 2 years

I do find this behaviour pretty reasonable for a b-tree algorithm. It works always bad on small data sets and very well on millions. Remember you access another directory, you always have two extra disk accesses (inode + toplevel btree). If you go down into a 2 level hierarchy it would be much slower.