How to run sed on over 10 million files in a directory?

9,501

Solution 1

Give this a try:

find -name '*.txt' -print0 | xargs -0 -I {} -P 0 sed -i -e 's/blah/blee/g' {}

It will only feed one filename to each invocation of sed. That will solve the "too many args for sed" problem. The -P option should allow multiple processes to be forked at the same time. If 0 doesn't work (it's supposed to run as many as possible), try other numbers (10? 100? the number of cores you have?) to limit the number.

Solution 2

I've tested this method (and all the others) on 10 million (empty) files, named "hello 00000001" to "hello 10000000" (14 bytes per name).

UPDATE: I've now included a quad-core run on the 'find |xargs' method (still without 'sed'; just echo >/dev/null)..

# Step 1. Build an array for 10 million files
#   * RAM usage approx:  1.5 GiB 
#   * Elapsed Time:  2 min 29 sec 
  names=( hello\ * )

# Step 2. Process the array.
#   * Elapsed Time:  7 min 43 sec
  for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done  

Here is a summary of how the provided answers fared when run against the test data mentioned above. These results involve only the basic overheads; ie 'sed' was not called. The sed process will almost certainly be the most time-consuming, but I thought it would be interesting to see how the bare methods compared.

Dennis's 'find |xargs' method, using a single core, took *4 hours 21 mins** longer than the bash array method on a no sed run... However, the multi-core advantage offered by 'find' should outweigh the time differences shown when sed is being called for processing the files...

           | Time    | RAM GiB | Per loop action(s). / The command line. / Notes
-----------+---------+---------+----------------------------------------------------- 
Dennis     | 271 min | 1.7 GiB | * echo FILENAME >/dev/null
Williamson   cores: 1x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} echo >/dev/null {}
                               | Note: I'm very surprised at how long this took to run the 10 million file gauntlet
                               |       It started processing almost immediately (because of xargs I suppose),  
                               |       but it runs **significantly slower** than the only other working answer  
                               |       (again, probably because of xargs) , but if the multi-core feature works  
                               |       and I would think that it does, then it could make up the defecit in a 'sed' run.   
           |  76 min | 1.7 GiB | * echo FILENAME >/dev/null
             cores: 4x2.66 MHz | $ time find -name 'hello *' -print0 | xargs -0 -I {} -P 0 echo >/dev/null {}
                               |  
-----------+---------+---------+----------------------------------------------------- 
fred.bear  | 10m 12s | 1.5 GiB | * echo FILENAME >/dev/null
                               | $ time names=( hello\ * ) ; time for (( ix=0, cnt=${#names[@]} ; ix<$cnt; ix++ )) ; do echo "${names[ix]}" >/dev/null ; done
-----------+---------+---------+----------------------------------------------------- 
l0b0       | ?@#!!#  | 1.7 GiB | * echo FILENAME >/dev/null 
                               | $ time  while IFS= read -rd $'\0' path ; do echo "$path" >/dev/null ; done < <( find "$HOME/junkd" -type f -print0 )
                               | Note: It started processing filenames after 7 minutes.. at this point it  
                               |       started lots of disk thrashing.  'find' was using a lot of memory, 
                               |       but in its basic form, there was no obvious advantage... 
                               |       I pulled the plug after 20 minutes.. (my poor disk drive :(
-----------+---------+---------+----------------------------------------------------- 
intuited   | ?@#!!#  |         | * print line (to see when it actually starts processing, but it never got there!)
                               | $ ls -f hello * | xargs python -c '
                               |   import fileinput
                               |   for line in fileinput.input(inplace=True):
                               |       print line ' 
                               | Note: It failed at 11 min and approx 0.9 Gib
                               |       ERROR message: bash: /bin/ls: Argument list too long  
-----------+---------+---------+----------------------------------------------------- 
Reuben L.  | ?@#!!#  |         | * One var assignment per file
                               | $ ls | while read file; do x="$file" ; done 
                               | Note: It bombed out after 6min 44sec and approx 0.8 GiB
                               |       ERROR message: ls: memory exhausted
-----------+---------+---------+----------------------------------------------------- 

Solution 3

Another opportunity for the completely safe find:

while IFS= read -rd $'\0' path
do
    file_path="$(readlink -fn -- "$path"; echo x)"
    file_path="${file_path%x}"
    sed -i -e 's/blah/blee/g' -- "$file_path"
done < <( find "$absolute_dir_path" -type f -print0 )

Solution 4

Try:

ls | while read file; do (something to $file); done

Solution 5

This is mostly off-topic, but you could use

find -maxdepth 1 -type f -name '*.txt' | xargs python -c '
import fileinput
for line in fileinput.input(inplace=True):
    print line.replace("blah", "blee"),
'

The main benefit here (over ... xargs ... -I {} ... sed ...) is speed: you avoid invoking sed 10 million times. It would be faster still if you could avoid using Python (since python is kind of slow, relatively), so perl might be a better choice for this task. I'm not sure how to do the equivalent conveniently with perl.

The way this works is that xargs will invoke Python with as many arguments as it can fit on a single command line, and keep doing that until it runs out of arguments (which are being supplied by ls -f *.txt). The number of arguments to each invocation will depend on the length of the filenames and, um, some other stuff. The fileinput.input function yields successive lines from the files named in each invocation's arguments, and the inplace option tells it to magically "catch" the output and use it to replace each line.

Note that Python's string replace method doesn't use regexps; if you need those, you have to import re and use print re.sub(line, "blah", "blee"). They are Perl-Compatible RegExps, which are sort of heavily fortified versions of the ones you get with sed -r.

edit

As akira mentions in the comments, the original version using a glob (ls -f *.txt) in place of the find command wouldn't work because globs are processed by the shell (bash) itself. This means that before the command is even run, 10 million filenames will be substituted into the command line. This is pretty much guaranteed to exceed the maximum size of a command's argument list. You can use xargs --show-limits for system-specific info on this.

The maximum size of the argument list is also taken into account by xargs, which limits the number of arguments it passes to each invocation of python according to that limit. Since xargs will still have to invoke python quite a few times, akira's suggestion to use os.path.walk to get the file listing will probably save you some time.

Share:
9,501

Related videos on Youtube

Sandro
Author by

Sandro

Updated on September 17, 2022

Comments

  • Sandro
    Sandro over 1 year

    I have a directory that has 10144911 files in it. So far I've tried the following:

    • for f in ls; do sed -i -e 's/blah/blee/g' $f; done

    Crashed my shell, the ls is in a tilda but i can't figure out how to make one.

    • ls | xargs -0 sed -i -e 's/blah/blee/g'

    Too many args for sed

    • find . -name "*.txt" -exec sed -i -e 's/blah/blee/g' {} \;

    Couldn't fork any more no more memory

    Any other ideas on how to create this kind command? The files don't need to communicate with each other. ls | wc -l seems to work (very slow) so it must be possible.

    • Admin
      Admin about 13 years
      It would be faster if you can avoid invoking sed for each file. I'm not sure if there's a way to open, edit, save, and close a series of files in sed; if speed is essential you may want to use a different program, perhaps perl or python.
    • Admin
      Admin about 13 years
      @akira: Are you saying that launching perl or python once for as many files as will fit on a command line is more expensive than launching sed once for each of those files? I would be really surprised if that were the case. —————— I guess you didn't understand that my suggestion is to invoke (start) the editing program once (or at least fewer times — see my answer), and have it open, modify and resave each of the files in turn, rather than invoking the editing program separately for each of those files.
    • Admin
      Admin about 13 years
      your first comment does not reflect what you really wanted to say: "replace sed by python/perl" .. by just doing that and looking @ the commandline OP has given, an innocent reader could assume that "find . -exec python" is faster than "find . -exec sed" .. which is obviously not the case. in your own answer you call python much more often than it is actually needed.
    • Admin
      Admin about 13 years
      I think that akira misinterpreted your (intuited) suggestion. I believe that you were suggesting to bunch files together. I tried that with my xargs attempt, time to try it again :)
    • Admin
      Admin about 13 years
      @Sandro: your 'xargs -0 sed -i' calls sed already on nr_x of files and is not launched for each file. i find @intuited's first comment just misleading because he provides only half of what he has in mind. and his answer left out the interesting part (for others) as well.
    • Admin
      Admin about 13 years
      Sandro: Crazy! I think for the benefit of the community, you should explain how you ended up in this situation. How big is the directory entry itself? Probably several hundred megs. What filesystem are you using? The xargs option might work if you use -n to limit the number of args per sed run.
  • geekosaur
    geekosaur about 13 years
    ls -f would be better; do you really want to wait around for it to stat() and sort that many files?
  • Sandro
    Sandro about 13 years
    right now i'm trying: for f in *.txt; do blah; done. I'll give that a whack if it fails. Thank you!
  • Chris Johnsen
    Chris Johnsen about 13 years
    Probably, it will need to be find . -name \*.txt -print0 to avoid having the shell expand the glob and trying to alloc space for 10 million arguments to find.
  • Dennis Williamson
    Dennis Williamson about 13 years
    @ChrisJohnsen: Yes, that's correct. I rushed posting my answer and missed including those essential parts. I've edited my answer with those corrections. Thanks.
  • akira
    akira about 13 years
    whats the point of using the glob operator (which will fail for that many files anyway) ... and then feed the files to python which has os.path.walk()?
  • intuited
    intuited about 13 years
    @akira: glob operator is to avoid trying to replace the contents of . and ... Certainly there are other ways to do that (i.e. find) but I'm trying to stick as closely as possible to what the OP understands. This is also the reason for not using os.path.walk.
  • intuited
    intuited about 13 years
    @akira: Good suggestion, though, that would probably be considerably faster.
  • Sandro
    Sandro about 13 years
    Trying it now... crosses fingers
  • akira
    akira about 13 years
    i think that OP will understand os.path.walk quite easily.