Most simple way of extracting substring in Unix shell?
Solution 1
cut
might be useful:
$ echo hello | cut -c1,3
hl
$ echo hello | cut -c1-3
hel
$ echo hello | cut -c1-4
hell
$ echo hello | cut -c4-5
lo
Shell Builtins are good for this too, here is a sample script:
#!/bin/bash
# Demonstrates shells built in ability to split stuff. Saves on
# using sed and awk in shell scripts. Can help performance.
shopt -o nounset
declare -rx FILENAME=payroll_2007-06-12.txt
# Splits
declare -rx NAME_PORTION=${FILENAME%.*} # Left of .
declare -rx EXTENSION=${FILENAME#*.} # Right of .
declare -rx NAME=${NAME_PORTION%_*} # Left of _
declare -rx DATE=${NAME_PORTION#*_} # Right of _
declare -rx YEAR_MONTH=${DATE%-*} # Left of _
declare -rx YEAR=${YEAR_MONTH%-*} # Left of _
declare -rx MONTH=${YEAR_MONTH#*-} # Left of _
declare -rx DAY=${DATE##*-} # Left of _
clear
echo " Variable: (${FILENAME})"
echo " Filename: (${NAME_PORTION})"
echo " Extension: (${EXTENSION})"
echo " Name: (${NAME})"
echo " Date: (${DATE})"
echo "Year/Month: (${YEAR_MONTH})"
echo " Year: (${YEAR})"
echo " Month: (${MONTH})"
echo " Day: (${DAY})"
That outputs:
Variable: (payroll_2007-06-12.txt)
Filename: (payroll_2007-06-12)
Extension: (txt)
Name: (payroll)
Date: (2007-06-12)
Year/Month: (2007-06)
Year: (2007)
Month: (06)
Day: (12)
And as per Gnudif above, there are always sed/awk/perl for when the going gets really tough.
Solution 2
Unix shells do not traditionally have regex support built-in. Bash and Zsh both do, so if you use the =~
operator to compare a string to a regex, then:
You can get the substrings from the $BASH_REMATCH
array in bash.
In Zsh, if the BASH_REMATCH
shell option is set, the value is in the $BASH_REMATCH
array, else it's in the $MATCH/$match
tied pair of variables (one scalar, the other an array). If the RE_MATCH_PCRE
option is set, then the PCRE engine is used, else the system regexp libraries, for an extended regexp syntax match, as per bash.
So, most simply: if you're using bash:
if [[ "$variable" =~ unquoted.*regex ]]; then
matched_portion="${BASH_REMATCH[0]}"
first_substring="${BASH_REMATCH[1]}"
fi
If you're not using Bash or Zsh, it gets more complicated as you need to use external commands.
Solution 3
Consider also /usr/bin/expr
.
$ expr substr hello 2 3
ell
You can also match patterns against the beginning of strings.
$ expr match hello h
1
$ expr match hello hell
4
$ expr match hello e
0
$ expr match hello 'h.*o'
5
$ expr match hello 'h.*l'
4
$ expr match hello 'h.*e'
2
Solution 4
grep and sed are probably the tools you want, depending on the structure of text.
sed should do the trick, if you do not know what the substring is, but know some pattern that is around it.
for example, if you want to find a substring of digits that starts with a "#" sign, you could write something like:
sed 's/^.*#\([0-9]\+\)/\1/g' yourfile
grep could do something similar, but the question is what you need to do with the substring and whether we are talking normal line-end delimited text or not.
roger.james
Updated on September 17, 2022Comments
-
roger.james over 1 year
How can I write a program that turns this XML string
<outer> <inner> <boom> <name>John</name> <address>New York City</address> </boom> <boom> <name>Daniel</name> <address>Los Angeles</address> </boom> <boom> <name>Joe</name> <address>Chicago</address> </boom> </inner> </outer>
into this string
name: John address: New York City name: Daniel address: Los Angeles name: Joe address: Chicago
Can LINQ make it easier?
-
FooBee over 13 yearsIt would help if you could describe what you want to extract from where. Even with complex tools like grep and sed, simple things tend to be simple.
-
Dennis Williamson over 13 yearsYour question is vague and too broad.
-
Arran almost 11 yearsYou could use LINQ2XML or HTMLAgilityPack ....
-
Tan almost 11 yearsLook at this. stackoverflow.com/questions/12037085/…
-
Geeky Guy almost 11 yearsYou should look into learning XPath: msdn.microsoft.com/en-us/magazine/cc164116.aspx
-
It'sNotALie. almost 11 years@Renan You most definitely should not.
-
Geeky Guy almost 11 years@newStackExchangeInstance I'm interested in the reason for that. Could you please elaborate?
-
It'sNotALie. almost 11 years@Renan XPath is an old, clunky technology that is stringly typed, and the error is way more specific.
-
Geeky Guy almost 11 yearsThanks. +1 for the clarification.
-
erikbstack about 10 yearsThe question is quite good! But the marked answer has nothing to do with the question.
-
-
Dennis Williamson over 13 yearsWhile you might sometimes need to declare a variable read-only, it's rare in a context such as this one to need to export a variable. It would be much simpler to just do an assignment:
var=value
without usingdeclare
at all. -
roger.james almost 11 yearsCan you generalize this to "loop" through the elements within "boom" (not assuming the fixed "name" and "address")?
-
roger.james almost 11 yearsThanks, that works. Can you add another update that allows me to filter out boom elements satisfying some arbitrary condition, e.g. where
address
contains "New" (in this case only "New York City", and only the first boom would show up)? -
roger.james almost 11 yearsI created a separate question for this, feel free to take a shot at it: stackoverflow.com/questions/17815090/…
-
erikbstack about 10 yearsWgy is this marked as solution? I don't see the regexes which was the actual question.
-
erikbstack about 10 yearsThis should be marked answer. But I'm not sure why I always need to write
\(...\)
and you don't seem to need that. -
ptman about 10 yearsPOSIX bourne shell doesn't, but
expr(1)
supports -
Eonil about 10 years@erikb I chose this for
cut
, because I put most point on the simplicity more then flexibility. I am sorry for vague question with regex, but I realized the regex itself is conflicts with simplicity. -
erikbstack about 10 years@Eonil Then would you mind changing the question title? People like me who look for regex solutions will hit this question and don't find a solution.
-
Eonil about 10 years@erikb That sounds reasonable. I did it!
-
Gnudiff almost 10 yearsOffhand, as far as I remember, it is dependant on the type of quotes you use: single or double quotes. Also might have caveats dependant on the shell used (I myself use tcsh).
-
Phil P almost 10 yearsYes,
expr(1)
is the obvious example of "use external commands", but it gets "interesting" to safely capture values which can contain arbitrary characters.