While loop for bash scripting to read either stdin or arguments

bash arguments stdin

5,545

The change is to this chunk

while read name; do
    efetch -db nucleotide -id $name -format gpc > $name.xml;
done < "$@"

which makes the efetch in the loop run with its standard input redirected to the file given by the arguments. So that makes two changes to the way efetch is used:

its standard input is no longer the default (terminal)
its list of parameters is no longer literally the command-line parameters for the script, but indirectly, from a file.

If efetch detects that its input is not a terminal, it could very well reopen the terminal directly (perhaps that is what you are referring to as "efetch accepts stdin instead of an id"). Alternatively, if efetch is reading its stdin, it could read something unexpected (in a quick test, that seems to be the script itself).

@chepner pointed out that the shell (bash in this case) does not spawn a subprocess for the loop. I had in mind a different case which does. Consider these two scripts:

#!/bin/bash 
LAST=...
while read name
do
    /bin/echo "** $name"
    LAST="$name"
done < "$@"
echo "...$LAST"

and

#!/bin/bash
LAST=...
cat "$@" | while read name
do
    /bin/echo "** $name"
    LAST="$name"
done
echo "...$LAST"

The latter (pipe) will echo "......" at the end, while the former (redirection) echoes the last variable assigned to LAST within the loop. The form using a pipe is sometimes commented on as requiring a subprocess to account for the reason why variable assignments are not propagated out of the loop.

Interestingly enough, there are differences between shells for the latter (a pipe) regarding the number of processes used. Testing with (Debian/testing) bash, dash (/bin/sh), zsh and ksh93, using strace -fo to capture system calls and process ids:

#!/bin/sh
for sh in bash dash zsh ksh93
do
    echo "++ $sh"
    strace -fo $sh.log ./do-$sh ./once
    LC=$(sed -e 's/ .*//' $sh.log |sort -u |wc -l)
    WC=$(wc -l $sh.log)
    echo "-- $LC / $WC"
done

The script shows the number of processes and the number of system calls for each shell. (The file once contains two lines: "first" and "second", to eliminate one testing boundary).

I see that zsh and ksh93 use one process fewer than bash and dash:

$ ./testit
++ bash
** first
** second
......
-- 5 / 401 bash.log
++ dash
** first
** second
......
-- 5 / 222 dash.log
++ zsh
** first
** second
...second
-- 4 / 568 zsh.log
++ ksh93
** first
** second
...second
-- 4 / 336 ksh93.log

Running the pipe takes 1 or 2 more processes than using a here-document for this example.

5,545

ahelix

Updated on September 18, 2022

Comments

ahelix almost 2 years

I'm playing around with the accepted answer from this thread: Bash script that reads filenames from a pipe or from command line args?

When I use the below script, efetch (ftp://ftp.ncbi.nlm.nih.gov/entrez/entrezdirect/edirect.zip) accepts an id (argument, for example, 941241313) but not stdin.

if [ $# -gt 0 ] ;then
    for name in "$@"; do
        efetch -db nucleotide -id $name -format gpc > $name.xml;
    done
  else
    IFS=$'\n' read -d '' -r -a filenames
    while read name; do
        efetch -db nucleotide -id $name -format gpc > $name.xml;
    done < "${filenames[@]}"
  fi

When I modify it to the below version, efetch accepts stdin instead of an id

if [ $# -gt 0 ] ;then
    while read name; do
        efetch -db nucleotide -id $name -format gpc > $name.xml;
    done < "$@"
  else
    IFS=$'\n' read -d '' -r -a filenames
    while read name; do
        efetch -db nucleotide -id $name -format gpc > $name.xml;
    done < "${filenames[@]}"
  fi

What's wrong?

chepner over 8 years

You can't read from multiple files by feeding an array expansion to the < operator; that's just a syntax error.

chepner over 8 years

There's no (additional) subprocess involved with redirection, but the point about efetch detecting if its standard input is a terminal or not stands.
Marius over 8 years

I see (will amend).