What's the fastest way to recursively search for files in python?
Maybe not the answer you were hoping for, but I think these timings are useful. Run on a directory with 15,424 directories totalling 102,799 files (of which 3059 are .py files).
Python 3.6:
import os
import glob
def walk():
pys = []
for p, d, f in os.walk('.'):
for file in f:
if file.endswith('.py'):
pys.append(file)
return pys
def iglob():
pys = []
for file in glob.iglob('**/*', recursive=True):
if file.endswith('.py'):
pys.append(file)
return pys
def iglob2():
pys = []
for file in glob.iglob('**/*.py', recursive=True):
pys.append(file)
return pys
# I also tried pathlib.Path.glob but it was slow and error prone, sadly
%timeit walk()
3.95 s ± 13 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit iglob()
5.01 s ± 19.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit iglob2()
4.36 s ± 34 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Using GNU find (4.6.0) on cygwin (4.6.0-1)
Edit: The below is on WINDOWS, on LINUX I found find
to be about 25% faster
$ time find . -name '*.py' > /dev/null
real 0m8.827s
user 0m1.482s
sys 0m7.284s
Seems like os.walk
is as good as you can get on windows.
Comments
-
Noise in the street almost 2 years
I need to generate a list of files with paths that contain a certain string by recursively searching. I'm doing this currently like this:
for i in iglob(starting_directory+'/**/*', recursive=True): if filemask in i.split('\\')[-1]: # ignore directories that contain the filemask filelist.append(i)
This works, but when crawling a large directory tree, it's woefully slow (~10 minutes). We're on Windows, so doing an external call to the unix find command isn't an option. My understanding is that glob is faster than os.walk.
Is there a faster way of doing this?