Extracting tokens from a line of text

17,409

Solution 1

UPDATE Please note that making an array this way is suitable only when IFS is a single non-whitespace character and there are no multiple-consecutive delimiters in the data string.
For a way around this issue, and a similar solution, go to this Unix & Linux question ... (and it is worth the read just to get more of an insight into IFS.


Use bash (and other POSIX shells, e.g. ash, ksh, zsh)'s IFS (Internal Field Seperator).

Using IFS avoids an external call, and it simply allows for embeded spaces.

# ==============
  A='token0:token1:token2.y   token2.z '
  echo normal. $A
# Save IFS; Change IFS to ":" 
  SFI=$IFS; IFS=:     ##### This is the important bit part 1a 
  set -f              ##### ... and part 1b: disable globbing
  echo changed $A
  B=($A)  ### this is now parsed at :  (not at the default IFS whitespace) 
  echo B...... $B
  echo B[0]... ${B[0]}
  echo B[1]... ${B[1]}
  echo B[2]... ${B[2]}
  echo B[@]... ${B[@]}
# Reset the original IFS
  IFS=$SFI             ##### Important bit part 2a
  set +f               ##### ... and part 2b
  echo normal. $A

# Output
normal. token0:token1:token2.y token2.z
changed token0 token1 token2.y   token2.z 
B...... token0
B[0]... token0
B[1]... token1
B[2]... token2.y   token2.z 
B[@]... token0 token1 token2.y   token2.z 
normal. token0:token1:token2.y token2.z

Solution 2

There are major two approaches. One is IFS, demonstrated by fred.bear. This has the advantage of not requiring a separate process, but it can be tricky to get right when your input might have characters that have special meaning to the shell. The other approach is to use a text processing utility. Field splitting is built into awk.

input="token1;token2;token3;token4"
awk -vinput="$input" 'BEGIN {
    count = split(input, a, ";");
    print "first field: " a[1];
    print "second: field" a[2];
    print "number of fields: " count;
    exit;
}'

Awk is particularly appropriate when processing multiple inputs.

command_producing_semicolon_separated_data |
awk -F ';' '{
    print "first field: " $1;
    print "second field: " $2;
    print "number of fields: " NF;
}'
Share:
17,409

Related videos on Youtube

Jas
Author by

Jas

Updated on September 17, 2022

Comments

  • Jas
    Jas over 1 year

    Using bash scripting and grep/awk/sed, how can I split a line matching a known pattern with a single character delimiter into an array, e.g. convert token1;token2;token3;token4 into a[0] = token1a[3]=token4 ?

    • alex
      alex about 13 years
      You answer yourself with the question tags: sed, awk, regex :)
    • Jas
      Jas about 13 years
      @Patkos - bash scripting + grep/awk/sed , whichever works best...
    • Kusalananda
      Kusalananda over 5 years
      Unclear: It is unclear whether a[0], a[1] etc. refers to an array in the shell or in awk.
  • ddeimeke
    ddeimeke about 13 years
    What do you do if token2 contains a whitespace?
  • Smiley
    Smiley about 13 years
    In that case you better take the approach as fred.bear has suggested. However, please remember to restore your IFS to the original value in that case.
  • Admin
    Admin about 13 years
    Do NOT underestimate the importance of "Important bit part 2". I've seen extraordinarily hard to debug problems arise from getting Important bit part 2 wrong.
  • Peter.O
    Peter.O about 13 years
    @Gilles: Your mod to the code (set -f, set +f) puzzles me; I don't see the connection between field seperators and globbing, but I'm happy to learn.. I am even more puzzled by the fact that when I introduce " * " to the first line, I get globbing of echo normal. $A which is the normal expectation.. However what has me completely baffled is that I get no globbing in any of the present lines when IFS=; This applies whether globbing is on or off.. And with globbing off in the same block, a new line echo * does expand!.. What's going on here? Globing and no globbing together.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' about 13 years
    @fred.bear: It's not about separators, it's about unprotected variable substitution ($A). Two things happen to $A: field splitting (on IFS) and pathname expansion (globbing). Compare sh -c 'set -f; echo $0' '/*' with sh -c 'echo $0' '/*'. I don't know what precise command has you confused, post a standalone example if you want me to look at it.
  • Peter.O
    Peter.O about 13 years
    You really got me thinking this time!... and I've finally worked it out! ... The seemingly eratic behaviour I observed comes from "observational habit" (If it looks like a duck, quacks like a duck, and walks like a duck, it's a duck! ... however all bets are off with space (the duck) when IFS=: ... the space may still be used by people as a visual delimiter, but globbing sees it only as just another character, and globbing needs a delimiter!... So echo * will expand "normally", but A=' *';echo $A` will only expand for a file whose name has a leading space... Mystery unravelled! ;)
  • Peter.O
    Peter.O about 13 years
    PS... but I still don't see why I need to turn globbing off... (must go now.. I'll think about it as I drive... and read your reference links later...
  • Peter.O
    Peter.O about 13 years
    @Gilles: If I'm wrong here, please let me know... (To glob or not to glob; that is the question) ... The anser is "Turn it off!", in all cases, unless you specifically and quite intentionally need it... As I've just found out, IFS can be deceptive because of its unusual/unfamiliar behaviour... I was focusing my question about globbing to the specific data in this example (which won't glob)... but now I'm a convert... globbing off (in the vast majority of cases)...
  • Michael Mrozek
    Michael Mrozek about 13 years
    There's a suggested code change pending here; I'll let you approve/reject it