Using a generated list of filenames as argument list -- with spaces

38,694

Solution 1

You could do the following using some implementations of find and xargs like this.

$ find . -type f -print0 | xargs -r0 ./myscript

or, standardly, just find:

$ find . -type f -exec ./myscript {} +

Example

Say I have the following sample directory.

$ tree
.
|-- dir1
|   `-- a\ file1.txt
|-- dir2
|   `-- a\ file2.txt
|-- dir3
|   `-- a\ file3.txt
`-- myscript

3 directories, 4 files

Now let's say I have this for ./myscript.

#!/bin/bash

for i in "$@"; do
    echo "file: $i"
done

Now when I run the following command.

$ find . -type f -print0 | xargs -r0 ./myscript 
file: ./dir2/a file2.txt
file: ./dir3/a file3.txt
file: ./dir1/a file1.txt
file: ./myscript

Or when I use the 2nd form like so:

$ find . -type f -exec ./myscript {} +
file: ./dir2/a file2.txt
file: ./dir3/a file3.txt
file: ./dir1/a file1.txt
file: ./myscript

Details

find + xargs

The above 2 methods, though looking different, are essentially the same. The first is taking the output from find, splitting it using NULLs (\0) via the -print0 switch to find. The xargs -0 is specifically designed to take input that's split using NULLs. That non-standard syntax was introduced by GNU find and xargs but is also found nowadays in a few others like most recent BSDs. The -r option is required to avoid calling myscript if find finds nothing with GNU find but not with BSDs.

NOTE: This entire approach hinges on the fact that you'll never pass a string that's exceedingly long. If it is, then a 2nd invocation of ./myscript will get kicked off with the remainder of subsequent results from find.

find with +

That's the standard way (though it was only added relatively recently (2005) to the GNU implementation of find). The ability to do what we're doing with xargs is literally built into find. So find will find a list of files and then pass that list as as many arguments as can fit to the command specified after -exec (note that {} can only be last just before + in this case), running the commands several times if needed.

Why no quoting?

In the first example we're taking a shortcut by completely avoiding the issues with the quoting, by using NULLs to separate the arguments. When xargs is given this list it's instructed to split on the NULLs effectively protecting our individual command atoms.

In the second example we're keeping the results internal to find and so it knows what each file atom is, and will guarantee to handle them appropriately, thereby avoiding the whoie business of quoting them.

Maximum size of command line?

This question comes up from time to time so as a bonus I'm adding it to this answer, mainly so I can find it in the future. You can use xargs to see what the environment's limit like so:

$ xargs --show-limits
Your environment variables take up 4791 bytes
POSIX upper limit on argument length (this system): 2090313
POSIX smallest allowable upper limit on argument length (all systems): 4096
Maximum length of command we could actually use: 2085522
Size of command buffer we are actually using: 131072

Solution 2

find . -name something.txt -exec myscript {} +

In the above, find finds all the matching file names and provides them as arguments to myscript. This works with file names regardless of spaces or any other odd characters.

If all the file names fit on one line, then myscript is executed once. If the list is too long for the shell to handle, then find will run myscript multiple times as needed.

MORE: How many files fit on a command line? man find says that find builds it command lines "much the same way that xargs builds its". And, man xargs that the limits are system dependent and that you can determine them by running xargs --show-limits. (getconf ARG_MAX is also a possibility). On Linux, the limit is typically (but not always) around 2 million characters per command line.

Solution 3

A few addition to @slm's fine answer.

The limitation on the size of the arguments is on the execve(2) system call (actually, it's on the cumulative size of the argument and environment strings and pointers). If myscript is written in a language that your shell can interpret, then maybe you don't need to execute it, you could have your shell just interpret it without having to execute another interpreter.

If you run the script as:

(. myscript x y)

It's like:

myscript x y

Except that it's being interpreted by a child of the current shell, instead of executing it (which eventually involves executing sh (or whatever the she-bang line specifies if any) with even more arguments).

Now obviously, you can't use find -exec {} + with the . command, as . being a builtin command of the shell, it has to be executed by the shell, not by find.

With zsh, it's easy:

IFS=$'\0'
(. myscript $(find ... -print0))

Or:

(. myscript ${(ps:\0:)"$(find ... -print0)"}

Though with zsh, you wouldn't need find in the first place as most of its features are built into zsh globbing.

bash variables however cannot contain NUL characters, so you have to find another way. One way could be:

files=()
while IFS= read -rd '' -u3 file; do
  files+=("$file")
done 3< <(find ... -print0)
(. myscript "${files[@]}")

You might also use zsh-style recursive globbing with with globstar option in bash 4.0 and later:

shopt -s globstar failglob dotglob
(. myscript ./**/something.txt)

Note that ** followed symlinks to directories until it was fixed in bash 4.3. Also note that bash doesn't implement zsh globbing qualifiers so you won't get all the features of find there.

Another alternative would be to use GNU ls:

eval "files=(find ... -exec ls -d --quoting-style=shell-always {} +)"
(. myscript "${files[@]}")

The above methods can also be used if you want to make sure myscript is executed only once (failing if the argument list is too large). On recent versions of Linux, you can raise and even lift that limitation on the argument list with:

ulimit -s 1048576

(1GiB stack size, a quarter of which can be used for the arg+env list).

ulimit -s unlimited

(no limit)

Solution 4

Isn't there some way to protect spaces in backtick (or $(...)) expansion?

No, there isn't. Why is that?

Bash has no way of knowing what should be protected and what shouldn't.

There are no arrays in the unix file/pipe. It's just a byte stream. The command inside the `` or $() outputs a stream, which bash swallows and treats as a single string. As that point, you only have two choices: put it in quotes, to keep it as one string, or put it naked, so that bash splits it up according to its configured behavior.

So what you have to do if you want an array is to define a byte format that has an array, and that's what tools like xargsand find do: If you run them with the -0 argument, they work according to a binary array format which terminates elements with the null byte, adding semantics to the otherwise opaque byte stream.

Unfortunately, bash cannot be configured to split strings on the null byte. Thanks to https://unix.stackexchange.com/a/110108/17980 for showing us that zsh can.

xargs

You want your command to run once, and you said that xargs -0 -n 10000 solves your problem. It doesn't, it ensures that if you have more than 10000 parameters, your command will run more than once.

If you want to make it strictly either run once or fail, you have to provide the -x argument and an -n argument larger than the -s argument (really: large enough that a whole bunch of zero-length arguments plus the name of the command do not fit in the -s size). (man xargs, see excerpt far below)

The system I'm currently on has a stack limited to about 8M, so here's my limit:

$ printf '%s\0' -- {1..1302582} | xargs -x0n 2076858 -s 2076858 /bin/true
xargs: argument list too long
$ printf '%s\0' -- {1..1302581} | xargs -x0n 2076858 -s 2076858 /bin/true
(no output)

bash

If you don't want to involve an external command, the while-read loop feeding an array, as shown in https://unix.stackexchange.com/a/110108/17980, is the only way for bash to split things at the null byte.

The idea to source the script ( . ... "$@" ) to avoid the stack size limit is cool (I tried it, it works!), but probably not important for normal situations.

Using a special fd for the process pipe is important if you want to read something else from stdin, but otherwise you won't need it.

So, the simplest "native" way, for everyday household needs:

files=()
while IFS= read -rd '' file; do
    files+=("$file")
done <(find ... -print0)

myscriptornonscript "${files[@]}"

If you like your process tree clean and nice to look at, this method allows you to do exec mynonscript "${files[@]}", which removes the bash process from memory, replacing it with the called command. xargs will always remain in memory while the called command runs, even if the command is only going to run once.


What speaks against the native bash method is this:

$ time { printf '%s\0' -- {1..1302581} | xargs -x0n 2076858 -s 2076858 /bin/true; }

real    0m2.014s
user    0m2.008s
sys     0m0.172s

$ time {
  args=()
  while IFS= read -rd '' arg; do
    args+=( "$arg" )
  done < <(printf '%s\0' -- $(echo {1..1302581}))
  /bin/true "${args[@]}"
}
bash: /bin/true: Argument list too long

real    107m51.876s
user    107m38.532s
sys     0m7.940s

bash is not optimized for array handling.


man xargs:

-n max-args

Use at most max-args arguments per command line. Fewer than max-args arguments will be used if the size (see the -s option) is exceeded, unless the -x option is given, in which case xargs will exit.

-s max-chars

Use at most max-chars characters per command line, including the command and initial-arguments and the terminating nulls at the ends of the argument strings. The largest allowed value is system-dependent, and is calculated as the argument length limit for exec, less the size of your environment, less 2048 bytes of headroom. If this value is more than 128KiB, 128Kib is used as the default value; otherwise, the default value is the maximum. 1KiB is 1024 bytes.

-x

Exit if the size (see the -s option) is exceeded.

Solution 5

In most systems, there is a limit on the length of a commandline passed to any program, using xargs or -exec command {} +. From man find:

-exec command {} +
      This  variant  of the -exec action runs the specified command on
      the selected files, but the command line is built  by  appending
      each  selected file name at the end; the total number of invoca‐
      tions of the command will  be  much  less  than  the  number  of
      matched  files.   The command line is built in much the same way
      that xargs builds its command lines.  Only one instance of  `{}'
      is  allowed  within the command.  The command is executed in the
      starting directory.

Invocations will be much less, but not guaranteed to be one. What you should do is read the NUL separated filenames in the script from stdin, possible based on a commandline argument -o -. I would do something like:

$ find . -name something.txt -print0 | myscript -0 -o -

and implement the option arguments to myscript accordingly.

Share:
38,694

Related videos on Youtube

alexis
Author by

alexis

Updated on September 18, 2022

Comments

  • alexis
    alexis almost 2 years

    I'm trying to invoke a script with a list of filenames collected by find. Nothing special, just someting like this:

    $ myscript `find . -name something.txt`
    

    The problem is that some of the pathnames contain spaces, so they get broken up into two invalid names on argument expansion. Normally I would surround the names with quotes, but here they're inserted by the backquote expansion. I've tried filtering the output of find and surrounding each filename with quotes, but by the time bash sees them, it's too late to strip them and they are treated as part of the filename:

    $ myscript `find . -name something.txt | sed 's/.*/"&"/'`
    No such file or directory: '"./somedir/something.txt"'
    

    Yes, that's the rules for how the command line is processed, but how do I get around it?

    This is embarrassing but I'm failing to come up with the right approach. I finally figured out how to do it with xargs -0 -n 10000... but it's such an ugly hack that I still want to ask: How do I quote the results of backquote expansion, or achieve the same effect in another way?

    Edit: I was confused about the fact that xargs does collect all arguments into a single argument list, unless it's told otherwise or system limits might be exceeded. Thanks to everyone for setting me straight! Others, keep this in mind as you read the accepted answer because it's not pointed out very directly.

    I've accepted the answer, but my question remains: Isn't there some way to protect spaces in backtick (or $(...)) expansion? (Note that the accepted solution is a non-bash answer).

    • Admin
      Admin over 10 years
      I guess you'd need to change what does the shell use as filename separators (for example, by playing with the value of IFS, one possible way is IFS=", newline, "). But is there a need to execute the script over all the filenames? If not, consider using find itself to execute the script for each file.
    • Admin
      Admin over 10 years
      Changing the IFS is a great idea, hadn't thought of it! Not practical for commandline usage, but still. :-) And yes, the goal is to pass all the arguments to the same invocation of my script.
  • alexis
    alexis over 10 years
    Thanks but I need to pass all the arguments to the same invocation of my script. That's in the problem description, but I guess I didn't make it clear that it's not incidental.
  • slm
    slm over 10 years
    @alexis - read the answers again, they are passing all the arguments to a single call of your script.
  • alexis
    alexis over 10 years
    I'll be damned! I didn't know about the + argument to find (and you use + in prose too, so I missed your explanation the first time). But more to the point, I'd misunderstood what xargs does by default!!! In three decades of using Unix I've never had a use for it until now, but I thought I knew my toolbox...
  • slm
    slm over 10 years
    @alexis - I figured you'd missed what we were saying. Yes xargs is a devil of a command. You have to read it and find's man pages many times over to grok what they can do. May of the switches are contra-positives of each other so that adds to the confusion.
  • slm
    slm over 10 years
    @alexis - also one more thing to add to the tool box, don't use the backquotes/backticks for running nested commands, use $(..) now instead. It automatically handles nesting of quotes etc. Backticks are being deprecated.
  • Timo
    Timo over 10 years
    The OP wants myscript to run once, that is not guaranteed with -exec myscript {} +, the man page only says that he invocations will be much less. xargs -0 has a limit as well.
  • alexis
    alexis over 10 years
    @Timo, thanks but as I understand it, these both accept large numbers of arguments. My arglists are not humongous, I was just mis-remembering how xargs works.
  • slm
    slm over 10 years
    @Timo - see updates, you can use xargs --show-limits to get size. My system shows it at 2MB.
  • slm
    slm over 10 years
    @alexis - see updates, you can use xargs --show-limits to get size. My system shows it at 2MB.
  • alexis
    alexis about 9 years
    Thanks for all the trouble but your basic premise ignores the fact that bash normally uses an elaborate system of quote processing. But not in backquote expansion. Compare the following (which both give errors, but show the difference): ls "what is this" vs. ls `echo '"what is this"'` . Someone neglected to implement quote processing for the result of backquotes.
  • clacke
    clacke about 9 years
    I'm glad backquotes don't do quote processing. The fact that they even do word splitting has caused enough confused looks, head-scratching and security flaws in modern computing history.
  • clacke
    clacke about 9 years
    The question is "Isn't there some way to protect spaces in backtick (or $(...)) expansion?", so it seems appropriate to ignore processing that is not done in that situation.
  • clacke
    clacke about 9 years
    The null-terminated element array format is the simplest and therefore safest way to express an array. It's just a shame that bash doesn't support it natively like apparently zsh does.
  • Jeremy Fishman
    Jeremy Fishman over 8 years
    I wanted my buffers sorted by file name, so I used the following bashism: xargs -0 -a <(find . -type f -print0 | sort -z) vi