Use bash to extract number from between square brackets
Solution 1
You can achieve this with just a single grep
command. This is because GNU grep lets you use a Perl regular expression (-P
), which supports zero-width lookaround assertions (\K
and (?=
)
, in this case):
grep -oP '^\[\K\d+(?=\])' infile
As written, that will send the output to your terminal. To redirect it to a file, use:
grep -oP '^\[\K\d+(?=\])' infile > outfile
This method has the advantage of brevity and simplicity. It matches text that
-
is preceded by (
\K
)- a
[
character(\[
) --\
is needed as[
otherwise has a special meaning in regular expressions - that appears at the beginning of a line (
^
);
- a
consists of one or more (
+
) digits (\d
);-
is followed by (
(?=
)
)- a
]
character (\]
) -- like with[
,\
forces]
to be matched literally.
- a
Solution 2
Using sed
:
< inputfile sed -n 's/^\[\([0-9]*\)\].*$/\1/p' > out
Command breakdown:
-
< inputfile
: redirects the content ofinputfile
tostdin
-
-n
: suppresses output -
> out
: redirects the content ofstdout
toout
Regex breakdown:
-
s
: performs a substitution -
/
: starts the regex -
^
: matches the start of the line -
\[
: matches a[
character -
\(
: starts the capturing group -
[0-9]*
: matches any number of digits -
\)
: stops the capturing group -
\]
: matches a]
character -
.*
: matches any number of any character -
$
: matches the end of the line -
/
: stops the regex / starts the replacement -
\1
: replaces with the first capturing group -
/
: stops the replacement -
p
: prints only the matching lines
Using grep
+tr
(if you need a method that runs both on Ubuntu and on another OS whose grep
doesn't support PCRE--otherwise, refer to Eliah Kagan's grep
-only version):
< inputfile grep -o '^\[[0-9]*\]' | tr -d '[]' > out
Command breakdown:
-
< inputfile
ingrep
: redirects the content ofinputfile
tostdin
-
-o
ingrep
: prints only the match -
-d
intr
: deletes the characters -
> out
intr
: redirects the content ofstdout
toout
Regex breakdown:
-
^
: matches the start of the line -
\[
: matches a[
character -
[0-9]*
: matches any number of digits -
\]
: matches a]
character
Solution 3
the perl
way:
perl -ne 'print "$1\n" if /^\[([0-9]*)\].*/' testdata > out
or with awk
:
awk 'match($0, /^\[[0-9]*\]/) {print substr($0, RSTART + 1, RLENGTH - 2)}' testdata > out
Used Regex in both cases:
^\[[0-9]*\]
Explanation
-
/^\[[0-9]*\]/
^
assert position at start of the string\[
matches the character[
literally-
[0-9]*
match a single character present in the list belowQuantifier:
*
Between zero and unlimited times, as many times as possible, giving back as needed [greedy]0-9
a single character in the range between 0 and 9
\]
matches the character]
literally
(source: debuggex.com)
Solution 4
Use this in Bash:
grep -oh '\[[0-9].*\]' mytestfile | sed 's/.*\[\([^]]*\)\].*/\1/g' > myresultfile
Solution 5
python
solution using re
module and considering two situations :
#!/usr/bin/env python2
import re
with open('/path/to/file.txt') as f:
for line in f:
digits_case_1 = re.search(r'(?<=^\[)\d+(?=\])', line)
digits_case_2 = re.search(r'(?<=^\[)\d+(?=\].*\);$)', line)
if digits_case_1:
print 'Not considering ");" at end: ' + digits_case_1.group()
if digits_case_2:
print 'Considering ");" at end: ' + digits_case_2.group()
Output :
Not considering ");" at end: 581
Not considering ");" at end: 50
Considering ");" at end: 50
Here i have considered two situations as your question does not seem clear to me.
digits_case_1
will print the digits match between[]
at the start of the line, it will not consider whether the line is ending with);
or not.digits_case_2
will print digits between[]
at the start of the line only if the line is ending with);
.
Related videos on Youtube
user3069326
Updated on September 18, 2022Comments
-
user3069326 over 1 year
My file looks like this:
[581]((((((((501:0.00024264,451:0.00024264):0.000316197,310:0.000558837):0.00857295,((589:0.000409158,538:0.000409158):0.000658084,207:0.00106724 ):0.00806454):0.0429702,(((198:0.00390205,91:0.00390205):0.016191,79:0.0200931):0.0147515,(187:0.00133008,50:0.00133008):0.0335145):0.0172574):0. 127506,((140:0.00253019,117:0.00253019):0.0533693,(((533:0.00728707,(463:8.80494e-05,450:8.80494e-05):0.00719902):0.0217722,389:0.0290593):0.0253 931,(((141:0.018004,107:0.018004):0.0143861,(111:0.00396127,(106:0.00161229,12:0.00161229):0.00234898):0.0284289):0.0145736,(129:0.0195982,((123: 0.0105973,66:0.0105973):0.0084867,10:0.019084):0.000514243):0.0273656):0.00748854):0.00144709):0.123708):0.000944439,((181:0.00108761,71:0.00108761):0.0819772); [50]((((((((501:0.00024264,451:0.00024264):0.000316197,310:0.000558837):0.00857295,((589:0.000409158,538:0.000409158):0.000658084,207:0.00106724):0.00806454):0.0429702,(((198:0.00390205,91:0.00390205):0.016191,79:0.0200931):0.0147515,(187:0.00133008,50:0.00133008):0.0335145):0.0172574):0.127506,((140:0.00253019,117:0.00253019):0.0533693,(((533:0.00728707,(463:8.80494e-05,450:8.80494e-05):0.00719902):0.0217722,389:0.0290593):0.0253931,(((141:0.018004,107:0.018004):0.0143861,(111:0.00396127,(106:0.00161229,12:0.00161229):0.00234898):0.0284289):0.0145736,(129:0.0195982,((123:0.0105973,66:0.0105973):0.0084867,10:0.019084):0.000514243):0.0273656):0.00748854):0.00144709):0.123708):0.000944439,((181:0.00108761,71:0.00108761):0.0819772);
Every new line starts with the pattern
[number]
. Every line ends with the pattern);
.I need to extract the numbers in the square brackets from the beginning of every line, and write them into a new file. I don't know how many lines the file has beforehand.
-
heemayl about 9 yearsIn your example, not every line starting with [num], not every line ending with ); ..perhaps lost while formatting.. Make it right please..
-
Eliah Kagan about 9 years@user3069326 Do you want to match only from lines that end in
);
? Out of the six lines you've shown us, only 2 end that way, only 2 start with a number in[
]
brackets, and only 1 line both starts with a number in[
]
brackets and ends in);
. Or do you mean you actually want a method that ignores line breaks and instead treats);
like a line break? If you want to split on);
rather than newlines, you should either edit your question to clarify that, or post a new one--which might be better since there are already 5 answers posted based on your question as originally asked. -
Tim almost 9 years@User3069626 Don't damage posts please.
-
user3069326 almost 9 yearsthe qs were off topic it shoudl eb delted
-
Tim almost 9 years@user no it shouldn't. We leave it around. Also, I've voted to reopen it. Just because it's closed doesn't mean it won't help someone.
-
user3069326 almost 9 yearsno thsi is against the rules..this shoudl be delted..nd i will need to report that behviour
-
Tim almost 9 yearsNo, it shouldn't be deleted, it's not off topic (it seems to have been closed incorrectly). Do not attempt to delete this. Even if it is off topic it should not be deleted. If you continue to remove it I will flag for moderator.
-
-
m. öztürk about 9 yearsExtra protection against multiple inputs (this is my habit in cases like this).
-
m. öztürk about 9 yearsNice, indeed. :)
-
Eliah Kagan about 9 yearsYou don't need to pipe from
cat
, sinceawk
accepts an input filename as an argument:awk -F\] '{ print $1 }' $FILE | grep '\[' | tr -d '['
-
Eliah Kagan about 9 years@kos I think the
grep
/tr
way is valuable and interesting. Unless there's some bug in it, I hope you put it (or something like it) back--of course it's your choice. While thegrep
-only way I posted may often be preferable, sometimes one wishes to write a command that is portable to systems whosegrep
doesn't support Perl regular expressions (as is the case for most implementations other than GNU grep). You could separate thegrep
/tr
way out into a separate, second section if you don't want it to distract from your primary recommended method. -
kos about 9 years@EliahKagan Yes, I've removed it because your version was better, also I tought that any
grep
version builtin in Ubuntu supports PCREs, but overall you're right it definetly can't hurt, I'm restoring it. -
Byte Commander about 9 yearsMaybe you could improve your answer by explaining what each part of your command line does? Thank you!
-
Eliah Kagan about 9 years@user3069326 If R says
Error: 'xyz' is an unrecognized escape in character string
, you can escape the first character of the problematic text with a \. This fixes the problem fully or gives a new error about later text. In your case, put a \ before[
. For me R does something else:paste("grep -oP '^[\K\d+(?=])' infile > outfile")
producesError: '\K' is an unrecognized escape in character string starting ""grep -oP '^[\K"
andpaste("grep -oP '^[\\K\\d+(?=])' infile > outfile")
succeeds. I know only a little about R; I don't know why my R and yours seem to behave differently.