Quicker to os.walk or glob?
Solution 1
I made a research on a small cache of web pages in 1000 dirs. The task was to count a total number of files in dirs. The output is:
os.listdir: 0.7268s, 1326786 files found
os.walk: 3.6592s, 1326787 files found
glob.glob: 2.0133s, 1326786 files found
As you see, os.listdir
is quickest of three. And glog.glob
is still quicker than os.walk
for this task.
The source:
import os, time, glob
n, t = 0, time.time()
for i in range(1000):
n += len(os.listdir("./%d" % i))
t = time.time() - t
print "os.listdir: %.4fs, %d files found" % (t, n)
n, t = 0, time.time()
for root, dirs, files in os.walk("./"):
for file in files:
n += 1
t = time.time() - t
print "os.walk: %.4fs, %d files found" % (t, n)
n, t = 0, time.time()
for i in range(1000):
n += len(glob.glob("./%d/*" % i))
t = time.time() - t
print "glob.glob: %.4fs, %d files found" % (t, n)
Solution 2
Don't waste your time for optimization before measuring/profiling. Focus on making your code simple and easy to maintain.
For example, in your code you precompile RE, which does not give you any speed boost, because re module has internal re._cache
of precompiled REs.
- Keep it simple
- if it's slow, then profile
- once you know exactly what needs to be optimized do some tweaks and always document it
Note, that some optimization done several years prior can make code run slower compared to "non-optimized" code. This applies especially for modern JIT based languages.
Solution 3
You can use os.walk and still use glob-style matching.
for root, dirs, files in os.walk(DIRECTORY):
for file in files:
if glob.fnmatch.fnmatch(file, PATTERN):
print file
Not sure about speed, but obviously since os.walk is recursive, they do different things.
joedborg
Updated on July 09, 2022Comments
-
joedborg almost 2 years
I'm messing around with file lookups in python on a large hard disk. I've been looking at os.walk and glob. I usually use os.walk as I find it much neater and seems to be quicker (for usual size directories).
Has anyone got any experience with them both and could say which is more efficient? As I say, glob seems to be slower, but you can use wildcards etc, were as with walk, you have to filter results. Here is an example of looking up core dumps.
core = re.compile(r"core\.\d*") for root, dirs, files in os.walk("/path/to/dir/") for file in files: if core.search(file): path = os.path.join(root,file) print "Deleting: " + path os.remove(path)
Or
for file in iglob("/path/to/dir/core.*") print "Deleting: " + file os.remove(file)
-
kgadek almost 10 years-1. OP mentioned a "large disk". Also, the code is obviously simple already. Moreover, OP seems to be at the stage of optimizing. It's a plague on SO to discard questions related to performance with something like "premature optimizations are root of blabla" (which are actually misquotations of Knuth).
-
Jules G.M. over 8 years-1 optimization is important in the real (professional) world, where things are often at a very large scale. don't just blindly diss optimization without any rational reason
-
Michał Šrajer over 8 yearsPremature optimization IS stupid. It makes code almost always harder to maintain and sometimes even makes it to perform worse. I don't say this is the case, but it may be.
-
CMCDragonkai about 6 yearsIsn't
os.walk
lazy (generator) whileglob
will create a large list in-memory? -
episodeyang almost 6 yearsThis does not run through the file tree recursively.
-
ghukill about 5 years
glob.iglob
will return a generator, python 2 docs.python.org/2/library/glob.html#glob.iglob, python 3 docs.python.org/3/library/glob.html#glob.iglob -
SmallChess over 3 yearsMade no sense here. Nonsense. Optimization here is of course important.
-
suvigyavijay almost 3 yearsThis is fixed for os.walk in Python 3.5+, as mentioned here: docs.python.org/3/library/os.html#os.walk This function now calls os.scandir() instead of os.listdir(), making it faster by reducing the number of calls to os.stat().