Extract substring using regular expression on a Unix file

22,276

Solution 1

Gnu grep

grep -oE '[[:alpha:]]+_[[:digit:]]+_[[:alpha:]]+_[[:digit:]]+' 

Use the perl-regex flag and look-behind and look-ahead assertions to guarantee that the match is surrounded by /

grep -oP '(?<=/)[[:alpha:]]+_[[:digit:]]+_[[:alpha:]]+_[[:digit:]]+(?=/)'

Solution 2

IMHO Perl offers the easiest and the most flexible solution:

perl -nE 'say $1 if m{/(\w+\d+\w+\d+)/};' input_file

Please note that input_file is optional: STDIN will be filtered if/when input file name is not given.

Solution 3

One way with awk:

awk -F/ '{for(i=1;i<=NF;i++)$0=($i~/_/)?$i:$0}1' file
Share:
22,276

Related videos on Youtube

g4ur4v
Author by

g4ur4v

N00B.

Updated on September 18, 2022

Comments

  • g4ur4v
    g4ur4v over 1 year

    I have file with below contents .

    /ABC/RTE/AD_900_VOP_123/OPP
    /ABC/RTE/TRE/AD_900_VOP_145/BBB
    /ABC/RTE/AN_900_VFP_124/FBF
    /ABC/RTE/HD_900_FOP_153/WEW
    /ABD/RDV/AD_900_VOP_123/OPP
    /ABC/RTE/WD_900_VOP_123/GRR/TRD
    /ABC/RTE/RTD/AR_900_VOP_443/SDD
    

    How can I use regular expression on this file such that I get the output such as

    AD_900_VOP_123
    AD_900_VOP_145
    AN_900_VFP_124
    HD_900_FOP_153
    AD_900_VOP_123
    WD_900_VOP_123
    AR_900_VOP_443
    
    • Admin
      Admin almost 11 years
      What is the criterion for picking the field of interest?
    • Admin
      Admin almost 11 years
      criteria is any pattern like <alphabets>_<digits>_<alphabets>_<digits> and fall between two /
    • Admin
      Admin almost 11 years
      awk -F/ '{print $3}'
    • Admin
      Admin over 9 years
      awk -F/ '{print $(NF-1)}' to find last dir (if those are dirs)
  • g4ur4v
    g4ur4v almost 11 years
    can you please explain it in one or two lines
  • g4ur4v
    g4ur4v almost 11 years
    Hi,I just ran it ,but I get the entire input as the result $ sed 's|.*/\([0-9_A-Z]\+900[0-9_A-Z]\+\)/.*|\1|' tstfile.txt /ABC/RTE/AD_900_VOP_123/OPP /ABC/RTE/TRE/AD_900_VOP_145/BBB /ABC/RTE/AN_900_VFP_124/FBF /ABC/RTE/HD_900_FOP_153/WEW /ABD/RDV/AD_900_VOP_123/OPP /ABC/RTE/WD_900_VOP_123/GRR/TRD /ABC/RTE/RTD/AR_900_VOP_443/SDD
  • g4ur4v
    g4ur4v almost 11 years
    No ,I am not :)
  • g4ur4v
    g4ur4v almost 11 years
    did you run it ?
  • slm
    slm almost 11 years
    @g4ur4v - Sorry I had to ask 8-). What version of sed are you using? I just ran what you sent me and it worked just fine. You can use this command: sed --version GNU sed version 4.2.1.
  • g4ur4v
    g4ur4v almost 11 years
    I am using mobaxterm on windows may be thats why I am not getting the desire result. $ sed --version This is not GNU sed version 4.0
  • slm
    slm almost 11 years
    @g4ur4v - Ah that makes more sense. MobaXterm doesn't include a 4.x GNU version of sed. I've updated your question to include a new tag for MobaXterm so that others are aware that you're using it - and that the Q&A are specific to that.
  • Johan
    Johan over 9 years
    A variation on this which is slightly longer but 100 times easier to read (and write!) is sed 's|.*/\(.._..._..._...\)/.*|\1|' <input
  • mikeserv
    mikeserv over 9 years
    @Johan - it is also far less capable - your version strictly delimits each field, mine will work with fields of any length. And I don't consider it easier to read or write.
  • mikeserv
    mikeserv over 9 years
    Using . like that in a global is usually looking for trouble. What if one of the fields winds up being only a single char? That field (and one or two that follow) goes poof. sed 's|/[^/_]\{3\}||g' would at least serve to ensure that you don't remove anything you shouldn't, though in some cases might result in your not removing something you should, which is usually the better alternative, as I consider it.
  • Johan
    Johan over 9 years
    @mikeserv It handles the sample data provided, not all possible types of data.