Most simple way of extracting substring in Unix shell?

149

Solution 1

cut might be useful:

$ echo hello | cut -c1,3
hl
$ echo hello | cut -c1-3
hel
$ echo hello | cut -c1-4
hell
$ echo hello | cut -c4-5
lo

Shell Builtins are good for this too, here is a sample script:

#!/bin/bash
# Demonstrates shells built in ability to split stuff.  Saves on
# using sed and awk in shell scripts. Can help performance.

shopt -o nounset
declare -rx       FILENAME=payroll_2007-06-12.txt

# Splits
declare -rx   NAME_PORTION=${FILENAME%.*}     # Left of .
declare -rx      EXTENSION=${FILENAME#*.}     # Right of .
declare -rx           NAME=${NAME_PORTION%_*} # Left of _
declare -rx           DATE=${NAME_PORTION#*_} # Right of _
declare -rx     YEAR_MONTH=${DATE%-*}         # Left of _
declare -rx           YEAR=${YEAR_MONTH%-*}   # Left of _
declare -rx          MONTH=${YEAR_MONTH#*-}   # Left of _
declare -rx            DAY=${DATE##*-}        # Left of _

clear

echo "  Variable: (${FILENAME})"
echo "  Filename: (${NAME_PORTION})"
echo " Extension: (${EXTENSION})"
echo "      Name: (${NAME})"
echo "      Date: (${DATE})"
echo "Year/Month: (${YEAR_MONTH})"
echo "      Year: (${YEAR})"
echo "     Month: (${MONTH})"
echo "       Day: (${DAY})"

That outputs:

  Variable: (payroll_2007-06-12.txt)
  Filename: (payroll_2007-06-12)
 Extension: (txt)
      Name: (payroll)
      Date: (2007-06-12)
Year/Month: (2007-06)
      Year: (2007)
     Month: (06)
       Day: (12)

And as per Gnudif above, there are always sed/awk/perl for when the going gets really tough.

Solution 2

Unix shells do not traditionally have regex support built-in. Bash and Zsh both do, so if you use the =~ operator to compare a string to a regex, then:

You can get the substrings from the $BASH_REMATCH array in bash.

In Zsh, if the BASH_REMATCH shell option is set, the value is in the $BASH_REMATCH array, else it's in the $MATCH/$match tied pair of variables (one scalar, the other an array). If the RE_MATCH_PCRE option is set, then the PCRE engine is used, else the system regexp libraries, for an extended regexp syntax match, as per bash.

So, most simply: if you're using bash:

if [[ "$variable" =~ unquoted.*regex ]]; then
  matched_portion="${BASH_REMATCH[0]}"
  first_substring="${BASH_REMATCH[1]}"
fi

If you're not using Bash or Zsh, it gets more complicated as you need to use external commands.

Solution 3

Consider also /usr/bin/expr.

$ expr substr hello 2 3
ell

You can also match patterns against the beginning of strings.

$ expr match hello h
1

$ expr match hello hell
4

$ expr match hello e
0

$ expr match hello 'h.*o'
5

$ expr match hello 'h.*l'
4

$ expr match hello 'h.*e'
2

Solution 4

grep and sed are probably the tools you want, depending on the structure of text.

sed should do the trick, if you do not know what the substring is, but know some pattern that is around it.

for example, if you want to find a substring of digits that starts with a "#" sign, you could write something like:

sed 's/^.*#\([0-9]\+\)/\1/g' yourfile

grep could do something similar, but the question is what you need to do with the substring and whether we are talking normal line-end delimited text or not.

Share:
149
roger.james
Author by

roger.james

Updated on September 17, 2022

Comments

  • roger.james
    roger.james over 1 year

    How can I write a program that turns this XML string

    <outer>
      <inner>
        <boom>
          <name>John</name>
          <address>New York City</address>
        </boom>
    
        <boom>
          <name>Daniel</name>
          <address>Los Angeles</address>
        </boom>
    
        <boom>
          <name>Joe</name>
          <address>Chicago</address>
        </boom>
      </inner>
    </outer>
    

    into this string

    name: John
    address: New York City
    
    name: Daniel
    address: Los Angeles
    
    name: Joe
    address: Chicago
    

    Can LINQ make it easier?

    • FooBee
      FooBee over 13 years
      It would help if you could describe what you want to extract from where. Even with complex tools like grep and sed, simple things tend to be simple.
    • Dennis Williamson
      Dennis Williamson over 13 years
      Your question is vague and too broad.
    • Arran
      Arran almost 11 years
      You could use LINQ2XML or HTMLAgilityPack ....
    • Tan
      Tan almost 11 years
    • Geeky Guy
      Geeky Guy almost 11 years
      You should look into learning XPath: msdn.microsoft.com/en-us/magazine/cc164116.aspx
    • It'sNotALie.
      It'sNotALie. almost 11 years
      @Renan You most definitely should not.
    • Geeky Guy
      Geeky Guy almost 11 years
      @newStackExchangeInstance I'm interested in the reason for that. Could you please elaborate?
    • It'sNotALie.
      It'sNotALie. almost 11 years
      @Renan XPath is an old, clunky technology that is stringly typed, and the error is way more specific.
    • Geeky Guy
      Geeky Guy almost 11 years
      Thanks. +1 for the clarification.
    • erikbstack
      erikbstack about 10 years
      The question is quite good! But the marked answer has nothing to do with the question.
  • Dennis Williamson
    Dennis Williamson over 13 years
    While you might sometimes need to declare a variable read-only, it's rare in a context such as this one to need to export a variable. It would be much simpler to just do an assignment: var=value without using declare at all.
  • roger.james
    roger.james almost 11 years
    Can you generalize this to "loop" through the elements within "boom" (not assuming the fixed "name" and "address")?
  • roger.james
    roger.james almost 11 years
    Thanks, that works. Can you add another update that allows me to filter out boom elements satisfying some arbitrary condition, e.g. where address contains "New" (in this case only "New York City", and only the first boom would show up)?
  • roger.james
    roger.james almost 11 years
    I created a separate question for this, feel free to take a shot at it: stackoverflow.com/questions/17815090/…
  • erikbstack
    erikbstack about 10 years
    Wgy is this marked as solution? I don't see the regexes which was the actual question.
  • erikbstack
    erikbstack about 10 years
    This should be marked answer. But I'm not sure why I always need to write \(...\) and you don't seem to need that.
  • ptman
    ptman about 10 years
    POSIX bourne shell doesn't, but expr(1) supports
  • Eonil
    Eonil about 10 years
    @erikb I chose this for cut, because I put most point on the simplicity more then flexibility. I am sorry for vague question with regex, but I realized the regex itself is conflicts with simplicity.
  • erikbstack
    erikbstack about 10 years
    @Eonil Then would you mind changing the question title? People like me who look for regex solutions will hit this question and don't find a solution.
  • Eonil
    Eonil about 10 years
    @erikb That sounds reasonable. I did it!
  • Gnudiff
    Gnudiff almost 10 years
    Offhand, as far as I remember, it is dependant on the type of quotes you use: single or double quotes. Also might have caveats dependant on the shell used (I myself use tcsh).
  • Phil P
    Phil P almost 10 years
    Yes, expr(1) is the obvious example of "use external commands", but it gets "interesting" to safely capture values which can contain arbitrary characters.