Is there something like JavaScript's "split()" in the shell?

9,613

Solution 1

Bourne/POSIX-like shells have a split+glob operator and it's invoked every time you leave a parameter expansion ($var, $-...), command substitution ($(...)), or arithmetic expansion ($((...))) unquoted in list context.

Actually, you invoked it by mistake when you did for name in ${array[@]} instead of for name in "${array[@]}". (Actually, you should beware that invoking that operator like that by mistake is source of many bugs and security vulnerabilities).

That operator is configured with the $IFS special parameter (to tell what characters to split on (though beware that space, tab and newline receive a special treatment there)) and the -f option to disable (set -f) or enable (set +f) the glob part.

Also note that while the S in $IFS was originally (in the Bourne shell where $IFS comes from) for Separator, in POSIX shells, the characters in $IFS should rather be seen as delimiters or terminators (see below for an example).

So to split on _:

string='var1_var2_var3'
IFS=_ # delimit on _
set -f # disable the glob part
array=($string) # invoke the split+glob operator

for i in "${array[@]}"; do # loop over the array elements.

To see the distinction between separator and delimiter, try on:

string='var1_var2_'

That will split it into var1 and var2 only (no extra empty element).

So, to make it similar to JavaScript's split(), you'd need an extra step:

string='var1_var2_var3'
IFS=_ # delimit on _
set -f # disable the glob part
temp=${string}_ # add an extra delimiter
array=($temp) # invoke the split+glob operator

(note that it would split an empty $string into 1 (not 0) element, like JavaScript's split()).

To see the special treatments tab, space and newline receive, compare:

IFS=' '; string=' var1  var2  '

(where you get var1 and var2) with

IFS='_'; string='_var1__var2__'

where you get: '', var1, '', var2, ''.

Note that the zsh shell doesn't invoke that split+glob operator implicitly like that unless in sh or ksh emulation. There, you have to invoke it explicitely. $=string for the split part, $~string for the glob part ($=~string for both), and it also has a split operator where you can specify the separator:

array=(${(s:_:)string})

or to preserve the empty elements:

array=("${(@s:_:)string}")

Note that there s is for splitting, not delimiting (also with $IFS, a known POSIX non-conformance of zsh). It's different from JavaScript's split() in that an empty string is split into 0 (not 1) element.

A notable difference with $IFS-splitting is that ${(s:abc:)string} splits on the abc string, while with IFS=abc, that would split on a, b or c.

With zsh and ksh93, the special treatment that space, tab or newline receive can be removed by doubling them in $IFS.

As a historic note, the Bourne shell (the ancestor or modern POSIX shells) always stripped the empty elements. It also had a number of bugs related to splitting and expansion of $@ with non-default values of $IFS. For instance IFS=_; set -f; set -- $@ would not be equivalent to IFS=_; set -f; set -- $1 $2 $3....

Splitting on regexps

Now for something closer to JavaScript's split() that can split on regular expressions, you'd need to rely on external utilities.

In the POSIX tool-chest,awk has a split operator that can split on extended regular expressions (those are more or less a subset of the Perl-like regular expressions supported by JavaScript).

split() {
  awk -v q="'" '
    function quote(s) {
      gsub(q, q "\\" q q, s)
      return q s q
    }
    BEGIN {
      n = split(ARGV[1], a, ARGV[2])
      for (i = 1; i <= n; i++) printf " %s", quote(a[i])
      exit
    }' "$@"
}
string=a__b_+c
eval "array=($(split "$string" '[_+]+'))"

The zsh shell has builtin support for Perl-compatible regular expressions (in its zsh/pcre module), but using it to split a string, though possible is relatively cumbersome.

Solution 2

Yes, use IFS and set it to _. Then use read -a to store into an array (-r turns off backslash expansion). Note that this is specific to bash; ksh and zsh have similar features with slightly different syntax, and plain sh doesn't have array variables at all.

$ r="var1_var2_var3"
$ IFS='_' read -r -a array <<< "$r"
$ for name in "${array[@]}"; do echo "+ $name"; done
+ var1
+ var2
+ var3

From man bash:

read

-a aname

The words are assigned to sequential indices of the array variable aname, starting at 0. aname is unset before any new values are assigned. Other name arguments are ignored.

IFS

The Internal Field Separator that is used for word splitting after expansion and to split lines into words with the read builtin command. The default value is ``''.

Note that read stops at the first newline. Pass -d '' to read to avoid that, but in that case, there will be an extra newline at the end due to the <<< operator. You can remove it manually:

IFS='_' read -r -d '' -a array <<< "$r"
array[$((${#array[@]}-1))]=${array[$((${#array[@]}-1))]%?}
Share:
9,613

Related videos on Youtube

Tommy
Author by

Tommy

Something for nothing. Bite me if you can score 9+ in a CPS Test.

Updated on September 18, 2022

Comments

  • Tommy
    Tommy almost 2 years

    It's very easy to use split() in JavaScript to break a string into an array.

    What about shell script?

    Say I want to do this:

    $ script.sh var1_var2_var3

    When the user give such string var1_var2_var3 to the script.sh, inside the script it will convert the string into an array like

    array=( var1 var2 var3 )
    for name in ${array[@]}; do
        # some code
    done
    
    • gwillie
      gwillie almost 9 years
      what shell are you using, with bash you can do IFS='_' read -a array <<< "${string}"
    • Sobrique
      Sobrique almost 9 years
      perl can do that too. It's not "pure" shell, but it's quite common.
    • Sobrique
      Sobrique almost 9 years
      I tend to work on 'is it probably installed on my linux box by default' and don't fret the minutiae :)
  • Stéphane Chazelas
    Stéphane Chazelas almost 9 years
    That assumes $r doesn't contain newline characters or backslashes. Also note that it will only work in recent versions of the bash shell.
  • cuonglm
    cuonglm almost 9 years
    Is there any reason for special treatments with tab, space and newline?
  • Stéphane Chazelas
    Stéphane Chazelas almost 9 years
    @cuonglm, generally you want to split on words when the delimiters are blanks, in the case of non-blank delimiters (like to split $PATH on :) on the contrary, you generally want to preserve empty elements. Note that in the Bourne shell, all characters were receiving the special treatment, ksh changed that to have only the blank ones (only space, tab and newline though) treated specially.
  • fedorqui
    fedorqui almost 9 years
    @StéphaneChazelas good point. Yes, this is the "basic" case of a string. For the rest, everyone should go for your comprehensive answer. Regarding the versions of bash, read -a was introduced in bash 4, right?
  • cuonglm
    cuonglm almost 9 years
    Well, the recent added Bourne shell note surprised me. And for completing, should you add the note for zsh treatment with string contains 2 or more characters in ${(s:string:)var}? If added, I can delete my answer :)
  • Stéphane Chazelas
    Stéphane Chazelas almost 9 years
    sorry my bad, I thought <<< was added only recently to bash but it seems it's been there since 2.05b (2002). read -a is even older than that. <<< comes from zsh and is supported by ksh93 (and mksh and yash) as well but read -a is bash-specific (it's -A in ksh93, yash and zsh).
  • fedorqui
    fedorqui almost 9 years
    @StéphaneChazelas is there any "easy" way to find when these changes happened? I say "easy" not to dig into the release files, maybe a page showing them all.
  • Stéphane Chazelas
    Stéphane Chazelas almost 9 years
    I look at change logs for that. zsh also has a git repository with history as far back as 3.1.5 and its mailing list is used for tracking changes as well.
  • terdon
    terdon almost 9 years
    What do you mean by "Also note that the S in $IFS is for Delimiter, not Separator."? I understand the mechanics and that it ignores trailing separators but the S stands for Separator, not delimiter. At least, that's what my bash's manual says.
  • Stéphane Chazelas
    Stéphane Chazelas almost 9 years
    @terdon, $IFS comes from the Bourne shell where it was separator, ksh changed the behaviour without changing the name. I mention that to stress that split+glob (except in zsh or pdksh) doesn't simply split anymore.
  • Stéphane Chazelas
    Stéphane Chazelas almost 9 years
    @Gilles, note that bash now supports ${array[-1]} like zsh (also as lvalue). Older versions also support ${array[@]: -1} like ksh93. Those also work for sparse arrays.
  • fra-san
    fra-san almost 4 years
    (I'm looking for a clear explanation of the difference between "delimiter" and "separator", which seems surprisingly hard to find.) Would it be correct to say that the IFS characters in the Bourne shell were separators because in that shell the empty elements were always stripped? And that, conversely, in POSIX shells they are delimiters/terminators because any single instance of them delimits (terminates) a possibly empty element?
  • Stéphane Chazelas
    Stéphane Chazelas almost 4 years
    @fra-san, in the Bourne shell, :a::b: with IFS=: was split into a and b. In shells that treat IFS as separator, it's split into "", a, "", b and "". In shells that treat it as delimiter, same without the last "". That also applied to read. See Shell read *sometimes* strips trailing delimiter