Use bash to extract number from between square brackets

8,708

Solution 1

You can achieve this with just a single grep command. This is because GNU grep lets you use a Perl regular expression (-P), which supports zero-width lookaround assertions (\K and (?= ), in this case):

grep -oP '^\[\K\d+(?=\])' infile

As written, that will send the output to your terminal. To redirect it to a file, use:

grep -oP '^\[\K\d+(?=\])' infile > outfile

This method has the advantage of brevity and simplicity. It matches text that

  • is preceded by (\K)

    • a [ character(\[) -- \ is needed as [ otherwise has a special meaning in regular expressions
    • that appears at the beginning of a line (^);
  • consists of one or more (+) digits (\d);

  • is followed by ((?= ))

    • a ] character (\]) -- like with [, \ forces ] to be matched literally.

Solution 2

Using sed:

  • < inputfile sed -n 's/^\[\([0-9]*\)\].*$/\1/p' > out

Command breakdown:

  • < inputfile: redirects the content of inputfile to stdin
  • -n: suppresses output
  • > out: redirects the content of stdout to out

Regex breakdown:

  • s: performs a substitution
  • /: starts the regex
  • ^: matches the start of the line
  • \[: matches a [ character
  • \(: starts the capturing group
  • [0-9]*: matches any number of digits
  • \): stops the capturing group
  • \]: matches a ] character
  • .*: matches any number of any character
  • $: matches the end of the line
  • /: stops the regex / starts the replacement
  • \1: replaces with the first capturing group
  • /: stops the replacement
  • p: prints only the matching lines

Using grep+tr (if you need a method that runs both on Ubuntu and on another OS whose grep doesn't support PCRE--otherwise, refer to Eliah Kagan's grep-only version):

  • < inputfile grep -o '^\[[0-9]*\]' | tr -d '[]' > out

Command breakdown:

  • < inputfile in grep: redirects the content of inputfile to stdin
  • -o in grep: prints only the match
  • -d in tr: deletes the characters
  • > out in tr: redirects the content of stdout to out

Regex breakdown:

  • ^: matches the start of the line
  • \[: matches a [ character
  • [0-9]*: matches any number of digits
  • \]: matches a ] character

Solution 3

the perl way:

perl -ne 'print "$1\n" if /^\[([0-9]*)\].*/' testdata > out

or with awk:

awk 'match($0, /^\[[0-9]*\]/) {print substr($0, RSTART + 1, RLENGTH - 2)}' testdata > out

Used Regex in both cases:

^\[[0-9]*\]

Explanation

  • /^\[[0-9]*\]/

    • ^ assert position at start of the string

    • \[ matches the character [ literally

    • [0-9]* match a single character present in the list below

      • Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]

      • 0-9 a single character in the range between 0 and 9

    • \] matches the character ] literally

    Regular expression visualization
    (source: debuggex.com)

    Debuggex Demo

Solution 4

Use this in Bash:

 grep -oh '\[[0-9].*\]' mytestfile | sed 's/.*\[\([^]]*\)\].*/\1/g' > myresultfile

Solution 5

python solution using re module and considering two situations :

#!/usr/bin/env python2
import re
with open('/path/to/file.txt') as f:
    for line in f:
        digits_case_1 = re.search(r'(?<=^\[)\d+(?=\])', line)
        digits_case_2 = re.search(r'(?<=^\[)\d+(?=\].*\);$)', line)
        if digits_case_1:
            print 'Not considering ");" at end: ' + digits_case_1.group()
        if digits_case_2:
            print 'Considering ");" at end: ' + digits_case_2.group()

Output :

Not considering ");" at end: 581
Not considering ");" at end: 50
Considering ");" at end: 50

Here i have considered two situations as your question does not seem clear to me.

  • digits_case_1 will print the digits match between [] at the start of the line, it will not consider whether the line is ending with ); or not.

  • digits_case_2 will print digits between [] at the start of the line only if the line is ending with );.

Share:
8,708

Related videos on Youtube

user3069326
Author by

user3069326

Updated on September 18, 2022

Comments

  • user3069326
    user3069326 over 1 year

    My file looks like this:

    [581]((((((((501:0.00024264,451:0.00024264):0.000316197,310:0.000558837):0.00857295,((589:0.000409158,538:0.000409158):0.000658084,207:0.00106724
    ):0.00806454):0.0429702,(((198:0.00390205,91:0.00390205):0.016191,79:0.0200931):0.0147515,(187:0.00133008,50:0.00133008):0.0335145):0.0172574):0.
    127506,((140:0.00253019,117:0.00253019):0.0533693,(((533:0.00728707,(463:8.80494e-05,450:8.80494e-05):0.00719902):0.0217722,389:0.0290593):0.0253
    931,(((141:0.018004,107:0.018004):0.0143861,(111:0.00396127,(106:0.00161229,12:0.00161229):0.00234898):0.0284289):0.0145736,(129:0.0195982,((123:
    0.0105973,66:0.0105973):0.0084867,10:0.019084):0.000514243):0.0273656):0.00748854):0.00144709):0.123708):0.000944439,((181:0.00108761,71:0.00108761):0.0819772);  
    [50]((((((((501:0.00024264,451:0.00024264):0.000316197,310:0.000558837):0.00857295,((589:0.000409158,538:0.000409158):0.000658084,207:0.00106724):0.00806454):0.0429702,(((198:0.00390205,91:0.00390205):0.016191,79:0.0200931):0.0147515,(187:0.00133008,50:0.00133008):0.0335145):0.0172574):0.127506,((140:0.00253019,117:0.00253019):0.0533693,(((533:0.00728707,(463:8.80494e-05,450:8.80494e-05):0.00719902):0.0217722,389:0.0290593):0.0253931,(((141:0.018004,107:0.018004):0.0143861,(111:0.00396127,(106:0.00161229,12:0.00161229):0.00234898):0.0284289):0.0145736,(129:0.0195982,((123:0.0105973,66:0.0105973):0.0084867,10:0.019084):0.000514243):0.0273656):0.00748854):0.00144709):0.123708):0.000944439,((181:0.00108761,71:0.00108761):0.0819772);
    

    Every new line starts with the pattern [number]. Every line ends with the pattern );.

    I need to extract the numbers in the square brackets from the beginning of every line, and write them into a new file. I don't know how many lines the file has beforehand.

    • heemayl
      heemayl about 9 years
      In your example, not every line starting with [num], not every line ending with ); ..perhaps lost while formatting.. Make it right please..
    • Eliah Kagan
      Eliah Kagan about 9 years
      @user3069326 Do you want to match only from lines that end in );? Out of the six lines you've shown us, only 2 end that way, only 2 start with a number in [ ] brackets, and only 1 line both starts with a number in [ ] brackets and ends in );. Or do you mean you actually want a method that ignores line breaks and instead treats ); like a line break? If you want to split on ); rather than newlines, you should either edit your question to clarify that, or post a new one--which might be better since there are already 5 answers posted based on your question as originally asked.
    • Tim
      Tim almost 9 years
      @User3069626 Don't damage posts please.
    • user3069326
      user3069326 almost 9 years
      the qs were off topic it shoudl eb delted
    • Tim
      Tim almost 9 years
      @user no it shouldn't. We leave it around. Also, I've voted to reopen it. Just because it's closed doesn't mean it won't help someone.
    • user3069326
      user3069326 almost 9 years
      no thsi is against the rules..this shoudl be delted..nd i will need to report that behviour
    • Tim
      Tim almost 9 years
      No, it shouldn't be deleted, it's not off topic (it seems to have been closed incorrectly). Do not attempt to delete this. Even if it is off topic it should not be deleted. If you continue to remove it I will flag for moderator.
  • m. öztürk
    m. öztürk about 9 years
    Extra protection against multiple inputs (this is my habit in cases like this).
  • m. öztürk
    m. öztürk about 9 years
    Nice, indeed. :)
  • Eliah Kagan
    Eliah Kagan about 9 years
    You don't need to pipe from cat, since awk accepts an input filename as an argument: awk -F\] '{ print $1 }' $FILE | grep '\[' | tr -d '['
  • Eliah Kagan
    Eliah Kagan about 9 years
    @kos I think the grep/tr way is valuable and interesting. Unless there's some bug in it, I hope you put it (or something like it) back--of course it's your choice. While the grep-only way I posted may often be preferable, sometimes one wishes to write a command that is portable to systems whose grep doesn't support Perl regular expressions (as is the case for most implementations other than GNU grep). You could separate the grep/tr way out into a separate, second section if you don't want it to distract from your primary recommended method.
  • kos
    kos about 9 years
    @EliahKagan Yes, I've removed it because your version was better, also I tought that any grep version builtin in Ubuntu supports PCREs, but overall you're right it definetly can't hurt, I'm restoring it.
  • Byte Commander
    Byte Commander about 9 years
    Maybe you could improve your answer by explaining what each part of your command line does? Thank you!
  • Eliah Kagan
    Eliah Kagan about 9 years
    @user3069326 If R says Error: 'xyz' is an unrecognized escape in character string, you can escape the first character of the problematic text with a \. This fixes the problem fully or gives a new error about later text. In your case, put a \ before [. For me R does something else: paste("grep -oP '^[\K\d+(?=])' infile > outfile") produces Error: '\K' is an unrecognized escape in character string starting ""grep -oP '^[\K" and paste("grep -oP '^[\\K\\d+(?=])' infile > outfile") succeeds. I know only a little about R; I don't know why my R and yours seem to behave differently.