What's a safe and portable way to split a string in shell programming?

shell shell-script split portability

5,118

Solution 1

Just set IFS according to you needs and let the shell perform word splitting:

IFS=':'
for dir in $PATH; do
    [ -x "$dir"/"$1" ] && echo $dir
done

This works in bash, dash and ksh, but tested only with the latest versions.

Solution 2

The obvious solution would be to use the shell word splitting, but beware of a few gotchas:

IFS=:
set -o noglob
for dir in $PATH''; do
    dir=${dir:-.}
    [ -x "${dir%/}/$1" ] && printf "%s\n" "$dir"
done

You need set -o noglob because when a variable is left unquoted, both word splitting and filename generation (globbing) are performed on it and here you only want word splitting (for instance, in the unlikely event that $PATH contains /usr/local/*bin*, you want it do look in the /usr/local/*bin* folder, not in /usr/local/bin and /usr/local/sbin..., and if PATH contains /*/*/*/../../../*/*/*/*/../../../*/*/*/*, you don't want it to bring your machine down)

An empty $PATH component means the current directory (.), not /. $dir/$1 wouldn't be correct in that case. The work around is either to write $dir${dir:+/}$1 or to change $dir to . in that case (which gives a more useful output when displayed with printf '%s\n' "$dir".

//foo is not necessarily the same as /foo, so if / is in $PATH, you don't want $dir/$1, which would be //$1. Hence the ${dir%/} to strip a trailing slash.

Then, there are a few other problems:

For $PATH, ":" is a field separator while for $IFS, it is a field terminator (yes, I know, S is for Separator, blame ksh and POSIX for standardizing the ksh behaviour).

So if $PATH is /usr/bin:/bin: (which is bad practice but still commonly found), that means "/usr/bin", "/bin" and "" (that is, the current directory), while the shell word splitting (all POSIX shells except zsh) will split that into /usr/bin and /bin only.

If $PATH is set but empty, that means: "look in the current directory only". While shells (including those that treat $IFS as a separator) will expand it to an empty list.

Appending the '' to $PATH above works around both issues.

Last but not least. If $PATH is unset, then that has a special meaning which is: look in the system default search list, which unfortunately means something different depending on who (what command) you ask.

$ env -u PATH bash -c 'type usbipd'
usbipd is /usr/local/sbin/usbipd
$ env -u PATH ksh -c 'type usbipd'
ksh: whence: usbipd: not found

And basically, in your script, you'd have to guess what that default search path is in the context that matters to you.

Note that POSIX leaves the behaviour unspecified when $PATH is unset or empty, so won't help you there. That also means that what I said above may not apply to some past, current or future POSIX/Unix systems.

In short, parsing $PATH to try and find out where a command would be run from is a tricky business.

There is a standard command for that, which is command:

ls_path=$(command -v ls)

But what one may ask is: why do you want to know?

Now onto restoring IFS to its default value:

oldIFS=$IFS
IFS=:
...
IFS=$oldIFS

will work in practice in most cases but is not guaranteed to work by POSIX.

The reason is that if $IFS was previously unset which means default splitting behaviour (that is in POSIX shells, split on space, tab or newline), after those commands, it will end up set but empty (which means no splitting).

Another potential problem is if you generalise that approach and use it in a lot of different functions, then if in the ... part above, you're calling a function that does the same thing (makes a copy of $IFS in $oldIFS), then you're going to loose the original $oldIFS and restore the wrong $IFS.

Instead you could use subshells when possible:

(
  IFS=:
  ...
)
# only the subshell's IFS was affected, the parent still has its own IFS

My approach is to set $IFS (and turn set -o noglob on or off) every time I need word splitting (which is rare) and not bother restoring the previous value. Of course, that doesn't work if your script calls someone else's code that doesn't follow that practice and assumes a default word splitting behaviour.

5,118

rahmu

Updated on September 18, 2022

Comments

rahmu over 1 year
When writing a shell script, I often want to split a string. Here's a very simple example:
```
for dir in $(echo $PATH | tr : " "); do
    [[ -x "$dir"/"$1" ]] && echo $dir
done
```
This will search each directory in the $PATH for an executable with the same name as $1. Pretty straightforward, it runs well, but breaks if a directory in my $PATH contains a whitespace in its name.

What's the recommended way to split a string at the occurrence of a recurrent separator?

Ideally, the solution would be able to run on (fairly) old shells, namely ksh88.
- Admin about 11 years
  
  See How to iterate through a comma-separated list and execute a command for each entry (which doesn't address the specificities of $PATH).
rahmu about 11 years

Thanks! How can I set back IFS to its original default values, once the processing is done?
rahmu about 11 years

Nevermind, I store the default value of IFS in a temp variable, which allows me to restore IFS easily. Thank you for the answer.
manatwork about 11 years

Either that, or force the shell to execute the given piece of code in a separate shell instance: (IFS=:; for … done). Of course, this is useful only if you not need anything later from whatever was set inside the loop.
rahmu about 11 years

Parsing $PATH was the simplest short example I could come up with. Splitting a string with non-whitespace delimiters is a common problem I run into. I wanted to know how members here dealt with it in a robust and portable way.
Stéphane Chazelas about 11 years

Well, at least, you'll have learnt that if the string ends with a delimiter, you won't get an empty element, and that you need set -f to avoid the other side effect of leaving a variable unquoted. A lot of this applies to other variables of the same form like $MANPATH, $LD_LIBRARY_PATH...
rahmu about 11 years

Yes, definitely. Thank you very much for the answer :)
Stéphane Chazelas about 11 years

@ruakh, while it is possible and allowed by POSIX, and it would make sense for a shell to have $IFS unset by default, it is not the case in any shell that I know. All the Bourne like shells I know have IFS=$' \t\n' in their initial IFS (with the exception of zsh which also has \0 (since it can))
Gilles 'SO- stop being evil' about 11 years

PATH="$PWD/*:/bin:/usr/bin"; mkdir \*; cp /bin/ls \*/foo and try your snippet with ls. You missed set -f, see Stephane Chazelas's answer.
chepner about 11 years

There may be a better way, but you can distinguish between a null parameter and an unset parameter by comparing ${FOO:-x} and ${FOO-x}. The two are equivalent for an unset parameter, but not a null parameter.
Stéphane Chazelas about 11 years

@chepner, a common trick to save and restore IFS, is to write it oIFS=$IFS; ${IFS+:} unset oIFS and then same to restore: IFS=$oIFS; ${oIFS+:} unset IFS, but there's still an issue if there are nested functions using that trick.
chepner about 11 years

Clever; it took me a moment to parse that.