How to solve linux subdirectories number limit?

linux filesystems directory scalability

15,909

Solution 1

That limit is per-directory, not for the whole filesystem, so you could work around it by further sub-dividing things. For instance instead of having all the user subdirectories in the same directory split them per the first two characters of the name so you have something like:

top_level_dir
|---aa
|   |---aardvark1
|   |---aardvark2
|---da
|   |---dan
|   |---david
|---do
    |---don

Even better would be to create some form of hash of the names and use that for the division. This way you'll get a better spread amongst the directories instead of, with the initial letters example, "da" being very full and "zz" completely empty. For instance if you take the CRC or MD5 the name and use the first 8 bits you'll get somethnig like:

top_level_dir
|---00
|   |---some_username
|   |---some_username
|---01
|   |---some_username
...
|---FF
|   |---some_username

This can be extended to further depths as needed, for instance like so if using the username not a hash value:

top_level_dir
|---a
|   |---a
|       |---aardvark1
|       |---aardvark2
|---d
    |---a
    |   |---dan
    |   |---david
    |---o
        |---don

This method is used in many places like squid's cache, to copy Ludwig's example, and the local caches of web browsers.

One important thing to note is that with ext2/3 you will start to hit performance issues before you get close to the 32,000 limit anyway, as directories are searched linearly. Moving to another filesystem (ext4 or reiser for instance) will remove this inefficiency (reiser searches directories with a binary-split algorimth so long directories are handled much more efficiently, ext4 may do too) as well as the fixed limit per directory.

Solution 2

If you are bound to ext2/ext3 the only possibility I see is to partition your data. Find a criterion that splits your data into manageable chunks of similar size.

If it's only about the profile images I'd do:

Use a hash (e.g SHA1) of the image
Use the SHA1 as file and directory name

For example the SQUID cache does it this way:

f/4b/353ac7303854033

Top level directory is the first hex-digit, second level is the next two hex-digits, and the file name is the remaining hex-digits.

Solution 3

Cant we have a better solution?

You do have a better solution - use a different filesystem, there are plenty available, many of which are optimised for different tasks. As you pointed out ReiserFS is optimised for handling lots of files in a directory.

See here for a comparison of filesystems.

Just be glad you're not stuck with NTFS which is truly abysmal for lots of files in a directory. I'd recommend JFS as a replacement if you don't fancy using the relatively new (but apparently stable) ext4 FS.

Solution 4

Is the profile image small? What about putting it in the database with the rest of the profile data? This might not be the best option for you, but worth considering...

Here is a ( older ) Microsoft whitepaper on the topic: To BLOB or not to BLOB.

Solution 5

Generally you want to avoid having directories with a large number of files/directories in it. Primary reason is that wildcard expansion on the command line, will result in "Too many arguments" errors resulting in much pain when trying to work with these directories.

Go for a solution that makes a deeper but narrower tree, e.g. by creating subfolders like others have described.

View more solutions

15,909

Author by

None-da

Learning.... Always

Updated on September 17, 2022

Comments

None-da over 1 year

I have a website which will store user profile images. Each image is stored in a directory (Linux) specific to the user. Currently I have a customer base of 30+, which means I will have 30+ folders. But my current Linux box (ext2/ext3) doesn't support creating more than 32000 directories. How do I get past this? Even YouTube guys have got the same problem, with video thumbnails. But they solved it by moving to ReiserFS. Can't we have a better solution?

Update:When asked in IRC, people were asking about upgrading it to ext4, which has 64k limit and of course you can even get past that too. Or kernel hacking to change the limit.

Update:How about splitting the user base into folders based on the userid range. Meaning 1-1000 in one folder, 1000-2000 in the other like that. This seems to be simple. What do you say, guys?

Frankly, isn't there any other way?
- Manuel Faux almost 15 years
  
  Why don't you want to change the filesystem? If this is an limitation of ext2/3 you won't have any other change than changing the filesystem or splitting the current FS into more smaller FSs (more different mount points).
- Kyle Brandt almost 15 years
  
  Manuel: If he changes the file system he is tying a specific FS to his application. Although that might end up being the answer, I would this is probably a problem that needs to be solved at the application level. If you need to hack the kernel or file system, you are probably going down the wrong path unless some very special requirements.
None-da almost 15 years

Just updated the question description to include this:"Update:How about splitting the user base into folders based on the userid range.Meaning 1-1000 in one folder, 1000-2000 in the other like that. This seems to be simple. What do you say?"
Axel almost 15 years

That would work well, and would be more efficient than a hash, if the users are generally identified by user ID instead of (or as well as) username. Though if you always refer to them by name elsewhere in the system you'll have to add extra name->id lookups all over the place.
sleske almost 15 years

You could probably use a simpler hash like CRC, as the hash does not need to be cryptographically strong like MD5 or SHA... but the performance difference is probably negligible anyway...
None-da almost 15 years

Thankyou David! I tried even different solution. I created hardly 4 folders with the range 1-30000, 30000-60000 etc.. I think getting a file from such a big directory will take more time than from a directory which has 1000 files(previous approach). What do you say?
user1089802 almost 15 years

Do you have good links to the NTFS filesystem performance?
Axel almost 15 years

That depends on the filesystem. If you are using ext2 or ext3 then I would recommend much smaller than 30,000 per directory. Some tools issue warnings about 10,000. You can turn directory indexing on in ext3/4 to help: tune2fs -O dir_index /dev/<volumename> but just keeping the number of objects in a directory lower (a couple of thousand or less?) is what I'd recommend here.
Decebal almost 15 years

yes, apart from personal experience with an app that was left too long creating new files in a directory.. (took hours to delete them all), and the subversion performance boost by limiting the number of files in a directory to 1000. Or read: support.microsoft.com/kb/130694 I don't think they ever "fixed" this as it still noted as a perf. tweak for NTFS.
Avery Payne almost 15 years

@Maddy, you want this solution due to other limitations on how Ext2/3 handles large numbers of files. See serverfault.com/questions/43133/… for some detail. Breaking out names into buckets-as-subdirectories alleviates other issues that you would have run into eventually. Note that this is the same strategy that Squid uses when it sets up the object cache for the first time - for instance, 64 directories each with 64 directories inside of them, just as an example.