Millions of (small) text files in a folder

11,558

Solution 1

The ls command, or even TAB-completion or wildcard expansion by the shell, will normally present their results in alphanumeric order. This requires reading the entire directory listing and sorting it. With ten million files in a single directory, this sorting operation will take a non-negligible amount of time.

If you can resist the urge of TAB-completion and e.g. write the names of files to be zipped in full, there should be no problems.

Another problem with wildcards might be wildcard expansion possibly producing more filenames than will fit on a maximum-length command line. The typical maximum command line length will be more than adequate for most situations, but when we're talking about millions of files in a single directory, this is no longer a safe assumption. When a maximum command line length is exceeded in wildcard expansion, most shells will simply fail the entire command line without executing it.

This can be solved by doing your wildcard operations using the find command:

find <directory> -name '<wildcard expression>' -exec <command> {} \+

or a similar syntax whenever possible. The find ... -exec ... \+ will automatically take into account the maximum command line length, and will execute the command as many times as required while fitting the maximal amount of filenames to each command line.

Solution 2

This is perilously close to an opinion-based question/answer but I'll try to provide some facts with my opinions.

  1. If you have a very large number of files in a folder, any shell-based operation that tries to enumerate them (e.g. mv * /somewhere/else) may fail to expand the wildcard successfully, or the result may be too large to use.
  2. ls will take longer to enumerate a very large number of files than a small number of files.
  3. The filesystem will be able to handle millions of files in a single directory, but people will probably struggle.

One recommendation is to split the filename into two, three or four character chunks and use those as subdirectories. For example, somefilename.txt might be stored as som/efi/somefilename.txt. If you are using numeric names then split from right to left instead of left to right so that there is a more even distribution. For example 12345.txt might be stored as 345/12/12345.txt.

You can use the equivalent of zip -j zipfile.zip path1/file1 path2/file2 ... to avoid including the intermediate subdirectory paths in the ZIP file.

If you are serving up these files from a webserver (I'm not entirely sure whether that's relevant) it is trivial to hide this structure in favour of a virtual directory with rewrite rules in Apache2. I would assume the same is true for Nginx.

Solution 3

I run a website which handles a database for movies, TV and video games. For each of these there are multiple images with TV containing dozens of images per show (i.e. episode snapshots etc).

There ends up being a lot of image files. Somewhere in the 250,000+ range. These are all stored in a mounted block storage device where access time is reasonable.

My first attempt at storing the images was in a single folder as /mnt/images/UUID.jpg

I ran into the following challenges.

  • ls via a remote terminal would just hang. The process would go zombie and CTRL+C would not break it.
  • before I reach that point any ls command would quickly fill the output buffer and CTRL+C would not stop the endless scrolling.
  • Zipping 250,000 files from a single folder took about 2 hours. You must run the zip command detached from the terminal otherwise any interruption in connection means you have to start over again.
  • I wouldn't risk trying to use the zip file on Windows.
  • The folder quickly became a no humans allowed zone.

I ended up having to store the files in subfolders using the creation time to create the path. Such as /mnt/images/YYYY/MM/DD/UUID.jpg. This resolved all the above problems, and allowed me to create zip files that targeted a date.

If the only identifier for a file you have is a numeric number, and these numbers tend to run in sequence. Why not group them by 100000, 10000 and 1000.

For example, if you have a file named 384295.txt the path would be:

/mnt/file/300000/80000/4000/295.txt

If you know you'll reach a few million. Use 0 prefixes for 1,000,000

/mnt/file/000000/300000/80000/4000/295.txt

Solution 4

Firstly: prevent 'ls' from sorting with 'ls -U', maybe update your ~/bashrc to have 'alias ls="ls -U"' or similar.

For your large fileset, you can try this out like this:

  • create a set of test files

  • see if many filenames cause issues

  • use xargs parmeter-batching and zip's (default) behaviour of adding files to a zip to avoid problems.

This worked well:

# create ~ 100k files
seq 1 99999 | sed "s/\(.*\)/a_somewhat_long_filename_as_a_prefix_to_exercise_zip_parameter_processing_\1.txt/" | xargs touch
# see if zip can handle such a list of names
zip -q /tmp/bar.zip ./*
    bash: /usr/bin/zip: Argument list too long
# use xargs to batch sets of filenames to zip
find . -type f | xargs zip -q /tmp/foo.zip
l /tmp/foo.zip
    28692 -rw-r--r-- 1 jmullee jmullee 29377592 2017-12-16 20:12 /tmp/foo.zip

Solution 5

Write text file from web scrape (shouldn't be affected by number of files in folder).

To create a new file requires scanning the directory file looking for enough empty space for the new directory entry. If no space is located that's large enough to store the new directory entry, it will be placed at the end of the directory file. As the number of files in a directory increases, the time to scan the directory also increases.

As long as the directory files remain in the system cache, the performance hit from this won't be bad, but if the data is released, reading the directory file (usually highly fragmented) from disk could consume quite a bit of time. An SSD improves this, but for a directory with millions of files, there could still be a noticeable performance hit.

Zip selected files, given by list of filenames.

This is also likely to require additional time in a directory with millions of files. In a file-system with hashed directory entries (like EXT4), this difference is minimal.

will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

A tree of subfolders has none of the above performance drawbacks. In addition, if the underlying file-system is changed to not have hashed file names, the tree methodology will still work well.

Share:
11,558

Related videos on Youtube

user1717828
Author by

user1717828

Updated on September 18, 2022

Comments

  • user1717828
    user1717828 almost 2 years

    We would like to store millions of text files in a Linux filesystem, with the purpose of being able to zip up and serve an arbitrary collection as a service. We've tried other solutions, like a key/value database, but our requirements for concurrency and parallelism make using the native filesystem the best choice.

    The most straightforward way is to store all files in a folder:

    $ ls text_files/
    1.txt
    2.txt
    3.txt
    

    which should be possible on an EXT4 file system, which has no limit to number of files in a folder.

    The two FS processes will be:

    1. Write text file from web scrape (shouldn't be affected by number of files in folder).
    2. Zip selected files, given by list of filenames.

    My question is, will storing up to ten million files in a folder affect the performance of the above operations, or general system performance, any differently than making a tree of subfolders for the files to live in?

    • Mark Plotnick
      Mark Plotnick over 6 years
      Related: How to fix intermittant “No space left on device” errors during mv when device has plenty of space. Using dir_index, which is often enabled by default, will speed up lookups but may limit the number of files per directory.
    • JoshuaD
      JoshuaD over 6 years
      Why not try it quickly on a virtual machine and see what it's like? With bash it's trivial to populate a folder with a million text files with random characters inside. I feel like you'll get really useful information that way, in addition to what you'll learn here.
    • Peter Cordes
      Peter Cordes over 6 years
      @JoshuaD: If you populate it all at once on a fresh FS, you're likely to have all the inodes contiguous on disk, so ls -l or anything else that stats every inode in the directory (e.g. bash globbing / tab completion) will be artificially faster than after some wear and tear (delete some files, write some new ones). ext4 might do better with this than XFS, because XFS dynamically allocates space for inodes vs. data, so you can end up with inodes more scattered, I think. (But that's a pure guess based on very little detailed knowledge; I've barely used ext4). Go with abc/def/subdirs.
    • JoshuaD
      JoshuaD over 6 years
      Yea, I don't think the test I suggested will be able to tell the OP "this will work", but it could definitely quickly tell him "this will not work", which is useful.
    • Andrew Henle
      Andrew Henle over 6 years
      but our requirements for concurrency and parallelism make using the native filesystem the best choice What did you try? Offhand, I'd think even a lower-end RDBMS such as MySQL and a Java servlet creating the zip files on the fly with ZipOutputStream would beat just about any free Linux native filesystem - I doubt you want to pay for IBM's GPFS. The loop to process a JDBC result set and make that zip stream is probably merely 6-8 lines of Java code.
  • roaima
    roaima over 6 years
    @StéphaneChazelas yes ok, my choice of words might have been better, but the net effect for the user is much the same. I'll see if I can alter the words slightly without getting bogged down in complexity.
  • dimm
    dimm over 6 years
    Modern filesystems use B, B+ or similar trees to keep directory entries. en.wikipedia.org/wiki/HTree
  • telcoM
    telcoM over 6 years
    Yes... but if the shell or the ls command won't get to know that the directory listing is already sorted, they are going to take the time to run the sorting algorithm anyway. And besides, the userspace may be using a localized sorting order (LC_COLLATE) that may be different from what the filesystem might do internally.
  • roaima
    roaima over 6 years
    @Octopus the OP states that the zip file will contain "selected files, given by list of filenames".
  • Andrew Henle
    Andrew Henle over 6 years
    I'd recommend using zip -j - ... and piping the output stream directly to the client's network connection over zip -j zipfile.zip .... Writing an actual zipfile to disk means the data path is read from disk->compress->write to disk->read from disk->send to client. That can up to triple your disk IO requirements over read from disk->compress->send to client.
  • roaima
    roaima over 6 years
    @AndrewHenie you're absolutely right. However, i wasn't particularly suggesting that someone should shell out to zip in a web application. I was merely pointing out one possible solution to the implied question "how do I hide this multi-directory structure from my users?"
  • chrishollinworth
    chrishollinworth over 3 years
    Opinion: don't put millions of files into the same directory. Its rarely necessary and breaks things in ways you don't expect. If you are numbering the files consecutively, then the best solution (assumin they need to be in a file system and not object storage) is this: take the number. Reverse it (as a string). Split this into path components to create a directory hierarchy. Append the name, e.g. 123456.txt becomes 65/34/21/123456.txt (thousands of files is OK, so you could use 3 digits rather than 2 as per example)