Faster awk script to get the substring / string we wanted

8,259

Solution 1

Try this:

awk -F'<25106>=' '{print substr($2,0,index($2,"]")-1);}'

Not using regex, just strict string operations.

Solution 2

If you will only print this number, you can try this:

echo "ORDER EVENT ......... [Account<25106>=ACCT1]" | awk -F'<25106>=' '{print $2}' | sed -e 's/].*//'

EDIT: sed -only solution:

echo "ORDER EVENT ......... [Account<25106>=ACCT1]" | sed -e 's/.*25106>=//' -e 's/].*//'

EDIT2:

awk '{if (split($0, a, "25106>=") > 1) {print substr(a[2], 0, index(a[2], "]")-1)} }'

Solution 3

If you have GNU awk (gawk) you can use the match() function with capturing parentheses:

gawk 'match($0, /<25106>=([^]]+)/, ary) {account = ary[1]}'

Alternately, you can use a complex field separator:

awk -F '<25106>=' '{split($2, ary, /\]/); account = ary[1]}'
Share:
8,259

Related videos on Youtube

Admin
Author by

Admin

Updated on September 18, 2022

Comments

  • Admin
    Admin almost 2 years
    ORDER EVENT .........[] [] ... so many other tags... [Account<25106>=ACCT1] [Destination...] .. so many other tags.
    

    I am currently trying to get the account like this. I tried using match in awk, but it seems slower. Can you suggest anything else other than the one below which is even faster?

    j = index($0, "<25106>=");
    account=substr($0, j + accountTagLength);
    account=substr(account,1,index(account, "]") - 1);
    

    Account is not 2nd field and the field position my vary..

    Timings:

    bash-3.2$ time head -1000000 temp.log | awk -F'<25106>=' '{print $2}' | sed -e 's/].*//' > /dev/null
    
    real    0m2.410s
    user    0m2.782s
    sys     0m0.319s
    bash-3.2$ time head -1000000 temp.log | awk '{j = index($0, "25106>="); if (j > 0) { account=substr($0, j + 7); substr(account,1,index(account, "]") - 1);} }'
    
    real    0m1.690s
    user    0m1.737s
    sys     0m0.448s
    bash-3.2$ time head -1000000 temp.log | awk '{j = index($0, "25106>="); if (j > 0) { account=substr($0, j + 7); substr(account,1,index(account, "]") - 1);} }'
    
    real    0m1.588s
    user    0m1.733s
    sys     0m0.179s
    bash-3.2$ time head -1000000 temp.log | awk -F'<25106>=' '{print $2}' | sed -e 's/].*//' > /dev/null                               
    real    0m2.384s
    user    0m2.762s
    sys     0m0.272s
    bash-3.2$ time head -1000000 temp.log | awk '{j = index($0, "25106>="); if (j > 0) { account=substr($0, j + 7); substr(account,1,index(account, "]") - 1);} }'
    
    real    0m1.703s
    user    0m1.709s
    sys     0m0.484s
    
    bash-3.2$ time head -1000000 dumper/cam_verbose.20120220.000.log | gawk 'match($0, /<25106>=([^]]+)/, ary) {account = ary[1]}'
    
    real    0m3.449s
    user    0m3.661s
    sys     0m0.290s
    bash-3.2$ time head -1000000 dumper/cam_verbose.20120220.000.log | gawk 'match($0, /<25106>=([^]]+)/, ary) {account = ary[1]}'
    
    real    0m3.410s
    user    0m3.551s
    sys     0m0.236s
    bash-3.2$ time head -1000000 dumper/cam_verbose.20120220.000.log | gawk 'match($0, /<25106>=([^]]+)/, ary) {account = ary[1]}'
    
    real    0m3.361s
    user    0m3.487s
    sys     0m0.286s
    bash-3.2$ time head -1000000 dumper/cam_verbose.20120220.000.log | awk '{j = index($0, "25106>="); if (j > 0) { account=substr($0, j + 7); substr(account,1,index(account, "]") - 1);} }'
    
    real    0m1.626s
    user    0m1.831s
    sys     0m0.263s
    bash-3.2$ time head -1000000 dumper/cam_verbose.20120220.000.log | awk -F '<25106>=' '{split($2, ary, /\]/); account = ary[1]}'
    
    real    0m2.721s
    user    0m2.808s
    sys     0m0.265s
    bash-3.2$ time head -1000000 dumper/cam_verbose.20120220.000.log | awk -F '<25106>=' '{split($2, ary, /\]/); account = ary[1]}'
    
    real    0m2.787s
    user    0m2.863s
    sys     0m0.516s
    bash-3.2$ time head -1000000 dumper/cam_verbose.20120220.000.log | awk -F '<25106>=' '{split($2, ary, /\]/); account = ary[1]}'
    
    real    0m2.724s
    user    0m2.882s
    sys     0m0.278s
    bash-3.2$ time head -1000000 dumper/cam_verbose.20120220.000.log | awk '{j = index($0, "25106>="); if (j > 0) { account=substr($0, j + 7); substr(account,1,index(account, "]") - 1);} }'
    
    real    0m1.576s
    user    0m1.748s
    sys     0m0.235s
    
    bash-3.2$ time head -100000 ORDER_EVENTS_CHAS_20120224.log | grep -oE '<25106>=([A-Za-z0-9]*)+' | cut -d= -f2 > /dev/null                                     
    real    0m2.098s
    user    0m2.131s
    sys     0m0.033s
    bash-3.2$ time head -100000 ORDER_EVENTS_CHAS_20120224.log | awk '{j = index($0, "25106>="); if (j > 0) { account=substr($0, j + 7); print substr(account,1,index(account, "]") - 1);} }' > /dev/null
    
    real    0m0.253s
    user    0m0.275s
    sys     0m0.040s
    bash-3.2$ time head -100000 ORDER_EVENTS_CHAS_20120224.log | grep -oE '<25106>=([A-Za-z0-9]*)+' | cut -d= -f2 > /dev/null                                     
    real    0m2.070s
    user    0m2.105s
    sys     0m0.034s
    bash-3.2$ time head -100000 ORDER_EVENTS_CHAS_20120224.log | grep -oE '<25106>=([A-Za-z0-9]*)+' > /dev/null
    
    real    0m2.065s
    user    0m2.090s
    sys     0m0.037s
        bash-3.2$ time head -1000000 ORDER_EVENTS_CHAS_20120228.log | awk -F'<25106>=' '{ substr($2,0,index($2,"]")-1);}'
    
    real    0m3.426s
    user    0m3.637s
    sys     0m0.412s
    bash-3.2$ time head -1000000 ORDER_EVENTS_CHAS_20120228.log | awk -F'<25106>=' '{ substr($2,0,index($2,"]")-1);}'
    
    real    0m3.463s
    user    0m3.603s
    sys     0m0.408s
    bash-3.2$ time head -1000000 ORDER_EVENTS_CHAS_20120228.log | awk '{j = index($0, "25106>="); if (j > 0) { account=substr($0, j + 7); substr(account,1,index(account, "]") - 1);} }'
    
    real    0m2.247s
    user    0m2.307s
    sys     0m0.649s
    
    • Admin
      Admin over 12 years
      just found out even though I am not looking for a grep solution, regular expressions with grep are so slow.
    • Richard Fortune
      Richard Fortune about 12 years
      Of course the literal string comparison is going to be faster than any regex comparison. And what you propose above is the straightforward implementation of that; so I wouldn't expect there to be anythin faster.
  • Admin
    Admin over 12 years
    sorry, Jan, I was vague in my explanation of question. It is not 2nd field and the field position my vary. I updated my question.
  • Jan Marek
    Jan Marek over 12 years
    @srikanthradix Updated.
  • Admin
    Admin over 12 years
    updated with timings using time command. still index + substring seems to be faster.
  • Admin
    Admin over 12 years
    anything with regex like match is slow. I have tried. I have updated the timings.
  • Jan Marek
    Jan Marek over 12 years
    @srikanthradix will be solution with only sed more faster?
  • Admin
    Admin over 12 years
    Actually, I am trying to find out whether there is anything faster only with awk.
  • Rag
    Rag over 12 years
    even that is slower than the index and substring. I updated the stats, if you go to the end.
  • Admin
    Admin over 12 years
    just out of curiosity, I tried with sed-only solution. It is way slow. bash-3.2$ time head -1000000 ORDER_EVENTS_CHAS_20120228.log | sed -e 's/.*25106>=//' -e 's/].*//' > /dev/null real 0m9.956s user 0m10.167s sys 0m0.441s bash-3.2$ time head -1000000 ORDER_EVENTS_CHAS_20120228.log | sed -e 's/.*25106>=//' -e 's/].*//' > /dev/null real 0m10.083s user 0m10.254s sys 0m0.343s
  • Jan Marek
    Jan Marek over 12 years
    @srikanthradix I've tried to add another solution.