Use bash to extract number from between square brackets

command-line bash python

8,708

Solution 1

You can achieve this with just a single grep command. This is because GNU grep lets you use a Perl regular expression (-P), which supports zero-width lookaround assertions (\K and (?= ), in this case):

grep -oP '^\[\K\d+(?=\])' infile

As written, that will send the output to your terminal. To redirect it to a file, use:

grep -oP '^\[\K\d+(?=\])' infile > outfile

This method has the advantage of brevity and simplicity. It matches text that

is preceded by (\K)
- a [ character(\[) -- \ is needed as [ otherwise has a special meaning in regular expressions
- that appears at the beginning of a line (^);
consists of one or more (+) digits (\d);
is followed by ((?= ))
- a ] character (\]) -- like with [, \ forces ] to be matched literally.

Solution 2

Using sed:

< inputfile sed -n 's/^\[$[0-9]*$\].*$/\1/p' > out

Command breakdown:

< inputfile: redirects the content of inputfile to stdin
-n: suppresses output
> out: redirects the content of stdout to out

Regex breakdown:

s: performs a substitution
/: starts the regex
^: matches the start of the line
\[: matches a [ character
\(: starts the capturing group
[0-9]*: matches any number of digits
\): stops the capturing group
\]: matches a ] character
.*: matches any number of any character
$: matches the end of the line
/: stops the regex / starts the replacement
\1: replaces with the first capturing group
/: stops the replacement
p: prints only the matching lines

Using grep+tr (if you need a method that runs both on Ubuntu and on another OS whose grep doesn't support PCRE--otherwise, refer to Eliah Kagan's grep-only version):

< inputfile grep -o '^\[[0-9]*\]' | tr -d '[]' > out

Command breakdown:

< inputfile in grep: redirects the content of inputfile to stdin
-o in grep: prints only the match
-d in tr: deletes the characters
> out in tr: redirects the content of stdout to out

Regex breakdown:

^: matches the start of the line
\[: matches a [ character
[0-9]*: matches any number of digits
\]: matches a ] character

Solution 3

the perl way:

perl -ne 'print "$1\n" if /^\[([0-9]*)\].*/' testdata > out

or with awk:

awk 'match($0, /^\[[0-9]*\]/) {print substr($0, RSTART + 1, RLENGTH - 2)}' testdata > out

Used Regex in both cases:

^\[[0-9]*\]

Explanation

/^\[[0-9]*\]/
- ^ assert position at start of the string
- \[ matches the character [ literally
- [0-9]* match a single character present in the list below
  - Quantifier: * Between zero and unlimited times, as many times as possible, giving back as needed [greedy]
  - 0-9 a single character in the range between 0 and 9
- \] matches the character ] literally
_{(source: debuggex.com)}

_{Debuggex Demo}

Solution 4

Use this in Bash:

 grep -oh '\[[0-9].*\]' mytestfile | sed 's/.*\[\([^]]*\)\].*/\1/g' > myresultfile

Solution 5

python solution using re module and considering two situations :

#!/usr/bin/env python2
import re
with open('/path/to/file.txt') as f:
    for line in f:
        digits_case_1 = re.search(r'(?<=^\[)\d+(?=\])', line)
        digits_case_2 = re.search(r'(?<=^\[)\d+(?=\].*\);$)', line)
        if digits_case_1:
            print 'Not considering ");" at end: ' + digits_case_1.group()
        if digits_case_2:
            print 'Considering ");" at end: ' + digits_case_2.group()

Output :

Not considering ");" at end: 581
Not considering ");" at end: 50
Considering ");" at end: 50

Here i have considered two situations as your question does not seem clear to me.

digits_case_1 will print the digits match between [] at the start of the line, it will not consider whether the line is ending with ); or not.
digits_case_2 will print digits between [] at the start of the line only if the line is ending with );.

View more solutions

8,708

user3069326

Updated on September 18, 2022

Comments

user3069326 over 1 year
My file looks like this:
```
[581]((((((((501:0.00024264,451:0.00024264):0.000316197,310:0.000558837):0.00857295,((589:0.000409158,538:0.000409158):0.000658084,207:0.00106724
):0.00806454):0.0429702,(((198:0.00390205,91:0.00390205):0.016191,79:0.0200931):0.0147515,(187:0.00133008,50:0.00133008):0.0335145):0.0172574):0.
127506,((140:0.00253019,117:0.00253019):0.0533693,(((533:0.00728707,(463:8.80494e-05,450:8.80494e-05):0.00719902):0.0217722,389:0.0290593):0.0253
931,(((141:0.018004,107:0.018004):0.0143861,(111:0.00396127,(106:0.00161229,12:0.00161229):0.00234898):0.0284289):0.0145736,(129:0.0195982,((123:
0.0105973,66:0.0105973):0.0084867,10:0.019084):0.000514243):0.0273656):0.00748854):0.00144709):0.123708):0.000944439,((181:0.00108761,71:0.00108761):0.0819772);  
[50]((((((((501:0.00024264,451:0.00024264):0.000316197,310:0.000558837):0.00857295,((589:0.000409158,538:0.000409158):0.000658084,207:0.00106724):0.00806454):0.0429702,(((198:0.00390205,91:0.00390205):0.016191,79:0.0200931):0.0147515,(187:0.00133008,50:0.00133008):0.0335145):0.0172574):0.127506,((140:0.00253019,117:0.00253019):0.0533693,(((533:0.00728707,(463:8.80494e-05,450:8.80494e-05):0.00719902):0.0217722,389:0.0290593):0.0253931,(((141:0.018004,107:0.018004):0.0143861,(111:0.00396127,(106:0.00161229,12:0.00161229):0.00234898):0.0284289):0.0145736,(129:0.0195982,((123:0.0105973,66:0.0105973):0.0084867,10:0.019084):0.000514243):0.0273656):0.00748854):0.00144709):0.123708):0.000944439,((181:0.00108761,71:0.00108761):0.0819772);
```
Every new line starts with the pattern [number]. Every line ends with the pattern );.

I need to extract the numbers in the square brackets from the beginning of every line, and write them into a new file. I don't know how many lines the file has beforehand.
- heemayl about 9 years
  
  In your example, not every line starting with [num], not every line ending with ); ..perhaps lost while formatting.. Make it right please..
- Eliah Kagan about 9 years
  
  @user3069326 Do you want to match only from lines that end in );? Out of the six lines you've shown us, only 2 end that way, only 2 start with a number in [ ] brackets, and only 1 line both starts with a number in [ ] brackets and ends in );. Or do you mean you actually want a method that ignores line breaks and instead treats ); like a line break? If you want to split on ); rather than newlines, you should either edit your question to clarify that, or post a new one--which might be better since there are already 5 answers posted based on your question as originally asked.
- Tim almost 9 years
  
  @User3069626 Don't damage posts please.
- user3069326 almost 9 years
  
  the qs were off topic it shoudl eb delted
- Tim almost 9 years
  
  @user no it shouldn't. We leave it around. Also, I've voted to reopen it. Just because it's closed doesn't mean it won't help someone.
- user3069326 almost 9 years
  
  no thsi is against the rules..this shoudl be delted..nd i will need to report that behviour
- Tim almost 9 years
  
  No, it shouldn't be deleted, it's not off topic (it seems to have been closed incorrectly). Do not attempt to delete this. Even if it is off topic it should not be deleted. If you continue to remove it I will flag for moderator.
m. öztürk about 9 years

Extra protection against multiple inputs (this is my habit in cases like this).
m. öztürk about 9 years

Nice, indeed. :)
Eliah Kagan about 9 years

You don't need to pipe from cat, since awk accepts an input filename as an argument: awk -F\] '{ print $1 }' $FILE | grep '\[' | tr -d '['
Eliah Kagan about 9 years

@kos I think the grep/tr way is valuable and interesting. Unless there's some bug in it, I hope you put it (or something like it) back--of course it's your choice. While the grep-only way I posted may often be preferable, sometimes one wishes to write a command that is portable to systems whose grep doesn't support Perl regular expressions (as is the case for most implementations other than GNU grep). You could separate the grep/tr way out into a separate, second section if you don't want it to distract from your primary recommended method.
kos about 9 years

@EliahKagan Yes, I've removed it because your version was better, also I tought that any grep version builtin in Ubuntu supports PCREs, but overall you're right it definetly can't hurt, I'm restoring it.
Byte Commander about 9 years

Maybe you could improve your answer by explaining what each part of your command line does? Thank you!
Eliah Kagan about 9 years

@user3069326 If R says Error: 'xyz' is an unrecognized escape in character string, you can escape the first character of the problematic text with a \. This fixes the problem fully or gives a new error about later text. In your case, put a \ before [. For me R does something else: paste("grep -oP '^[\K\d+(?=])' infile > outfile") produces Error: '\K' is an unrecognized escape in character string starting ""grep -oP '^[\K" and paste("grep -oP '^[\\K\\d+(?=])' infile > outfile") succeeds. I know only a little about R; I don't know why my R and yours seem to behave differently.