Is there a faster way than this to find all the files in a directory and all sub directories?

56,324

Solution 1

Try this iterator block version that avoids recursion and the Info objects:

public static IEnumerable<string> GetFileList(string fileSearchPattern, string rootFolderPath)
{
    Queue<string> pending = new Queue<string>();
    pending.Enqueue(rootFolderPath);
    string[] tmp;
    while (pending.Count > 0)
    {
        rootFolderPath = pending.Dequeue();
        try
        {
            tmp = Directory.GetFiles(rootFolderPath, fileSearchPattern);
        }
        catch (UnauthorizedAccessException)
        {
            continue;
        }
        for (int i = 0; i < tmp.Length; i++)
        {
            yield return tmp[i];
        }
        tmp = Directory.GetDirectories(rootFolderPath);
        for (int i = 0; i < tmp.Length; i++)
        {
            pending.Enqueue(tmp[i]);
        }
    }
}

Note also that 4.0 has inbuilt iterator block versions (EnumerateFiles, EnumerateFileSystemEntries) that may be faster (more direct access to the file system; less arrays)

Solution 2

Cool question.

I played around a little and by leveraging iterator blocks and LINQ I appear to have improved your revised implementation by about 40%

I would be interested to have you test it out using your timing methods and on your network to see what the difference looks like.

Here is the meat of it

private static IEnumerable<FileInfo> GetFileList(string searchPattern, string rootFolderPath)
{
    var rootDir = new DirectoryInfo(rootFolderPath);
    var dirList = rootDir.GetDirectories("*", SearchOption.AllDirectories);

    return from directoriesWithFiles in ReturnFiles(dirList, searchPattern).SelectMany(files => files)
           select directoriesWithFiles;
}

private static IEnumerable<FileInfo[]> ReturnFiles(DirectoryInfo[] dirList, string fileSearchPattern)
{
    foreach (DirectoryInfo dir in dirList)
    {
        yield return dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
    }
}

Solution 3

The short answer of how to improve the performance of that code is: You cant.

The real performance hit your experiencing is the actual latency of the disk or network, so no matter which way you flip it, you have to check and iterate through each file item and retrieve directory and file listings. (That is of course excluding hardware or driver modifications to reduce or improve disk latency, but a lot of people are already paid a lot of money to solve those problems, so we'll ignore that side of it for now)

Given the original constraints there are several solutions already posted that more or less elegantly wrap the iteration process (However, since I assume that I'm reading from a single hard-drive, parallelism will NOT help to more quickly transverse a directory tree, and may even increase that time since you now have two or more threads fighting for data on different parts of the drive as it attempts to seek back and fourth) reduce the number of objects created, etc. However if we evaluate the how the function will be consumed by the end developer there are some optimizations and generalizations that we can come up with.

First, we can delay the execution of the performance by returning an IEnumerable, yield return accomplishes this by compiling in a state machine enumerator inside of an anonymous class that implements IEnumerable and gets returned when the method executes. Most methods in LINQ are written to delay execution until the iteration is performed, so the code in a select or SelectMany will not be performed until the IEnumerable is iterated through. The end result of delayed execution is only felt if you need to take a subset of the data at a later time, for instance, if you only need the first 10 results, delaying the execution of a query that returns several thousand results won't iterate through the entire 1000 results until you need more than ten.

Now, given that you want to do a subfolder search, I can also infer that it may be useful if you can specify that depth, and if I do this it also generalizes my problem, but also necessitates a recursive solution. Then, later, when someone decides that it now needs to search two directories deep because we increased the number of files and decided to add another layer of categorization you can simply make a slight modification instead of re-writing the function.

In light of all that, here is the solution I came up with that provides a more general solution than some of the others above:

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, string rootFolderPath)
{
    return BetterFileList(fileSearchPattern, new DirectoryInfo(rootFolderPath), 1);
}

public static IEnumerable<FileInfo> BetterFileList(string fileSearchPattern, DirectoryInfo directory, int depth)
{
    return depth == 0
        ? directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly)
        : directory.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly).Concat(
            directory.GetDirectories().SelectMany(x => BetterFileList(fileSearchPattern, x, depth - 1)));
}

On a side note, something else that hasn't been mentioned by anyone so far is file permissions and security. Currently, there's no checking, handling, or permissions requests, and the code will throw file permission exceptions if it encounters a directory it doesn't have access to iterate through.

Solution 4

This takes 30 seconds to get 2 million file names that meet the filter. The reason this is so fast is because I am only performing 1 enumeration. Each additional enumeration affects performance. The variable length is open to your interpretation and not necessarily related to the enumeration example.

if (Directory.Exists(path))
{
    files = Directory.EnumerateFiles(path, "*.*", SearchOption.AllDirectories)
    .Where(s => s.EndsWith(".xml") || s.EndsWith(".csv"))
    .Select(s => s.Remove(0, length)).ToList(); // Remove the Dir info.
}

Solution 5

I recently (2020) discovered this post because of a need to count files and directories across slow connections, and this was the fastest implementation I could come up with. The .NET enumeration methods (GetFiles(), GetDirectories()) perform a lot of under-the-hood work that slows them down tremendously by comparison.

This solution does not return FileInfo objects, but could be modified to do so—or potentially only return the relevant data needed from a custom FileInfo object.

This solution utilizes the Win32 API and .NET's Parallel.ForEach() to leverage the threadpool to maximize performance.

P/Invoke:

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findfirstfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern IntPtr FindFirstFile(
    string lpFileName,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findnextfilew
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindNextFile(
    IntPtr hFindFile,
    ref WIN32_FIND_DATA lpFindFileData
    );

/// <summary>
/// https://docs.microsoft.com/en-us/windows/win32/api/fileapi/nf-fileapi-findclose
/// </summary>
[DllImport("kernel32.dll", SetLastError = true)]
public static extern bool FindClose(
    IntPtr hFindFile
    );

Method:

public static Tuple<long, long> CountFilesDirectories(
    string path,
    CancellationToken token
    )
{
    if (String.IsNullOrWhiteSpace(path))
        throw new ArgumentNullException("path", "The provided path is NULL or empty.");

    // If the provided path doesn't end in a backslash, append one.
    if (path.Last() != '\\')
        path += '\\';

    IntPtr hFile = IntPtr.Zero;
    Win32.Kernel32.WIN32_FIND_DATA fd = new Win32.Kernel32.WIN32_FIND_DATA();

    long files = 0;
    long dirs = 0;

    try
    {
        hFile = Win32.Kernel32.FindFirstFile(
            path + "*", // Discover all files/folders by ending a directory with "*", e.g. "X:\*".
            ref fd
            );

        // If we encounter an error, or there are no files/directories, we return no entries.
        if (hFile.ToInt64() == -1)
            return Tuple.Create<long, long>(0, 0);

        //
        // Find (and count) each file/directory, then iterate through each directory in parallel to maximize performance.
        //

        List<string> directories = new List<string>();

        do
        {
            // If a directory (and not a Reparse Point), and the name is not "." or ".." which exist as concepts in the file system,
            // count the directory and add it to a list so we can iterate over it in parallel later on to maximize performance.
            if ((fd.dwFileAttributes & FileAttributes.Directory) != 0 &&
                (fd.dwFileAttributes & FileAttributes.ReparsePoint) == 0 &&
                fd.cFileName != "." && fd.cFileName != "..")
            {
                directories.Add(System.IO.Path.Combine(path, fd.cFileName));
                dirs++;
            }
            // Otherwise, if this is a file ("archive"), increment the file count.
            else if ((fd.dwFileAttributes & FileAttributes.Archive) != 0)
            {
                files++;
            }
        }
        while (Win32.Kernel32.FindNextFile(hFile, ref fd));

        // Iterate over each discovered directory in parallel to maximize file/directory counting performance,
        // calling itself recursively to traverse each directory completely.
        Parallel.ForEach(
            directories,
            new ParallelOptions()
            {
                CancellationToken = token
            },
            directory =>
            {
                var count = CountFilesDirectories(
                    directory,
                    token
                    );

                lock (directories)
                {
                    files += count.Item1;
                    dirs += count.Item2;
                }
            });
    }
    catch (Exception)
    {
        // Handle as desired.
    }
    finally
    {
        if (hFile.ToInt64() != 0)
            Win32.Kernel32.FindClose(hFile);
    }

    return Tuple.Create<long, long>(files, dirs);
}

On my local system, the performance of GetFiles()/GetDirectories() can be close to this, but across slower connections (VPNs, etc.) I found that this is tremendously faster—45 minutes vs. 90 seconds to access a remote directory of ~40k files, ~40 GB in size.

This can also fairly easily be modified to include other data, like the total file size of all files counted, or rapidly recursing through and deleting empty directories, starting at the furthest branch.

Share:
56,324
Eric Anastas
Author by

Eric Anastas

Updated on July 09, 2022

Comments

  • Eric Anastas
    Eric Anastas almost 2 years

    I'm writing a program that needs to search a directory and all its sub directories for files that have a certain extension. This is going to be used both on a local, and a network drive, so performance is a bit of an issue.

    Here's the recursive method I'm using now:

    private void GetFileList(string fileSearchPattern, string rootFolderPath, List<FileInfo> files)
    {
        DirectoryInfo di = new DirectoryInfo(rootFolderPath);
    
        FileInfo[] fiArr = di.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly);
        files.AddRange(fiArr);
    
        DirectoryInfo[] diArr = di.GetDirectories();
    
        foreach (DirectoryInfo info in diArr)
        {
            GetFileList(fileSearchPattern, info.FullName, files);
        }
    }
    

    I could set the SearchOption to AllDirectories and not use a recursive method, but in the future I'll want to insert some code to notify the user what folder is currently being scanned.

    While I'm creating a list of FileInfo objects now all I really care about is the paths to the files. I'll have an existing list of files, which I want to compare to the new list of files to see what files were added or deleted. Is there any faster way to generate this list of file paths? Is there anything that I can do to optimize this file search around querying for the files on a shared network drive?


    Update 1

    I tried creating a non-recursive method that does the same thing by first finding all the sub directories and then iteratively scanning each directory for files. Here's the method:

    public static List<FileInfo> GetFileList(string fileSearchPattern, string rootFolderPath)
    {
        DirectoryInfo rootDir = new DirectoryInfo(rootFolderPath);
    
        List<DirectoryInfo> dirList = new List<DirectoryInfo>(rootDir.GetDirectories("*", SearchOption.AllDirectories));
        dirList.Add(rootDir);
    
        List<FileInfo> fileList = new List<FileInfo>();
    
        foreach (DirectoryInfo dir in dirList)
        {
            fileList.AddRange(dir.GetFiles(fileSearchPattern, SearchOption.TopDirectoryOnly));
        }
    
        return fileList;
    }
    

    Update 2

    Alright so I've run some tests on a local and a remote folder both of which have a lot of files (~1200). Here are the methods I've run the tests on. The results are below.

    • GetFileListA(): Non-recursive solution in the update above. I think it's equivalent to Jay's solution.
    • GetFileListB(): Recursive method from the original question
    • GetFileListC(): Gets all the directories with static Directory.GetDirectories() method. Then gets all the file paths with the static Directory.GetFiles() method. Populates and returns a List
    • GetFileListD(): Marc Gravell's solution using a queue and returns IEnumberable. I populated a List with the resulting IEnumerable
      • DirectoryInfo.GetFiles: No additional method created. Instantiated a DirectoryInfo from the root folder path. Called GetFiles using SearchOption.AllDirectories
    • Directory.GetFiles: No additional method created. Called the static GetFiles method of the Directory using using SearchOption.AllDirectories
    Method                       Local Folder       Remote Folder
    GetFileListA()               00:00.0781235      05:22.9000502
    GetFileListB()               00:00.0624988      03:43.5425829
    GetFileListC()               00:00.0624988      05:19.7282361
    GetFileListD()               00:00.0468741      03:38.1208120
    DirectoryInfo.GetFiles       00:00.0468741      03:45.4644210
    Directory.GetFiles           00:00.0312494      03:48.0737459
    

    . . .so looks like Marc's is the fastest.

  • Eric Anastas
    Eric Anastas over 14 years
    This yield/return stuff is new to me. If I save the results of your method to a IEnumerable<string> variable, and run multiple ForEach iterations over the variable, will each ForEach run off some cache of all the values or will each ForEach re-run GetFileList()?
  • Jay
    Jay over 14 years
    Eric, don't forget that when testing, you have to traverse the IEnumerable<FileInfo> collection to get this code to actually execute. You should at least echo the filename to console to compare the two approaches, otherwise you've got an apples and oranges comparison when the iterator method returns so fast but execution is deferred.
  • Eric Anastas
    Eric Anastas over 14 years
    What about just populating a List<FileINfo>? List<FileInfo> myList = new List<FileINfo>(GetFileLIst(some stuff));
  • Marc Gravell
    Marc Gravell over 14 years
    @Eric - It will re-run it, but in that scenario buffer it, with either .ToList() (LINQ), or new List<string>(GetFileList(...))
  • Brad Cunningham
    Brad Cunningham over 14 years
    Jay is correct, of course you need to traverse the collection to see the performance hit. I purposely omitted how I was echoing to the console to calculate the timing assuming you may have been doing something specific or calculating timing more precisely. This is what I did (the call to .Count will cause the traversal): var start = DateTime.Now; IEnumerable<FileInfo> fileList = GetFileList("*.txt", @"C:\Temp"); var end = DateTime.Now; Console.WriteLine(String.Format("Files Found: {0}", fileList.Count())); Console.WriteLine(String.Format("Time: {0}", end.Subtract(start).TotalMilliseconds));
  • Brad Cunningham
    Brad Cunningham over 14 years
    Ran out of chars to answer your comment Eric. But yes putting the resulting IEnum into a List<T> will cause a list traversal too
  • Jaider
    Jaider over 11 years
    I think yield delay or postpone the execution for later on, for instance when you use ToList(), or print out all... It only is useful if you are looking for the first result or something at front, so you can dismiss the rest of the execution.
  • v.oddou
    v.oddou almost 9 years
    The EnumerateFiles technique, I tried it, it is unusable because of exceptions thrown in the middle. (unauthorized access) some people on stackoverflow offers wrappers around it though, if you search for that problem.
  • Fred Smith
    Fred Smith over 8 years
    How do you manage Unauthorized access exception knowing that you cannot use yield return inside a try catch ?
  • Fred Smith
    Fred Smith over 8 years
    you can try a Process.Start(cmd.exe, "dir c:\\ /s>allfiles.txt"); and then parse allfiles.txt
  • LeopardSkinPillBoxHat
    LeopardSkinPillBoxHat about 7 years
    @MarcGravell - thanks for this code. I took the liberty of modifying it to more robustly handle an UnauthorizedAccessException when accessing certain directories. If you're unhappy with my change, please feel free to modify it or roll back.
  • Malcolm Salvador
    Malcolm Salvador about 7 years
    "Length" does not exist in the current context. What should it contain?
  • Kentonbmax
    Kentonbmax about 7 years
    it is simply for removing a portion of the path and is open to interpretation. The example used here is a precursor to comparing 2 different lists of files, one that is local and one in cloud.
  • Pure.Krome
    Pure.Krome almost 7 years
    Use a ConcurrentBag<T> or ConcurrentDictionary<T,U> instead of manually locking, etc.
  • Ken
    Ken over 5 years
    I would like to know how to use a list of file extensions in the where clause.. s.EndsWith(myListOfExtensions) with out adding them individually in hard coded. Also how would this handle UnathorizedAccessException ?
  • joedotnot
    joedotnot about 3 years
    Probably missing a try/catch around Directory.GetDirectories() also, nonetheless great starter code.
  • rollsch
    rollsch over 2 years
    I like this version. Did you experiment with number of threads to test performance @hazaaa
  • Hazaaa
    Hazaaa over 2 years
    @rolls mmm not really. I needed this for my small project so i've tried to finish it as fast as i can. You can try it and report back results :)