Parse lines with specific pattern out of file

command-line bash perl

5,082

Solution 1

Using grep :

$ grep "^\[[0-9]\+\]:" file.txt 
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001
[29]:((962:0.000580339,930:0.000580339):0.00543993 ((758:0.000598847,726:0.000598847)

To save the output in a file (output.txt):

grep "^\[[0-9]\+\]:" file.txt > output.txt

Using python:

#!/usr/bin/env python2
import re
with open('/path/to/file.txt') as f:
  print '\n'.join([line.rstrip() for line in f if re.search(r'^\[\d+\]:', line)])

Solution 2

The perl way:

perl -ne 'print "$1\n" if /^(\[[0-9]*\]:.*)/' testdata > out

The awk way:

awk 'match($0, /^\[[0-9]*\]:/)' testdata > out

Output for both commands

[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001
[29]:((962:0.000580339,930:0.000580339):0.00543993 ((758:0.000598847,726:0.000598847)

Solution 3

This task is perfectly suited for grep, because you're just checking which lines contain a match for a pattern and printing the lines that do.

heemayl's way is excellent. Here's another that's similar but uses Perl regular expression syntax (which GNU grep supports, with -P), for a shorter and slightly simpler pattern:

grep -P '\[\d+\]:' infile

That just prints the output, but you can redirect it to outfile:

grep -P '\[\d+\]:' infile > outfile

In Perl regular expressions, \d matches any single digit, same as [0-9] or [[:digit:]].

In case you're interested, here's a sed way:

sed -nr '/^\[[0-9]+\]:/p' infile

sed -nr '/^\[[0-9]+\]:/p' infile > outfile

That checks each line to see if it matches ^\[[0-9]+\]:. If it does, the sed command p is used to print the line. The -n flag prevents any lines from being printed except as provided for explicitly by the sed script.

5,082

user3069326

Updated on September 18, 2022

Comments

user3069326 almost 2 years
I have a file that looks roughly like this :
```
[25]:0.00843832,469:0.0109533):0.00657864,((((872:0.00120503,((980:0.0001
[29]:((962:0.000580339,930:0.000580339):0.00543993 ((758:0.000598847,726:0.000598847)
position:
sites: 5 4 2 1 3 4 543 5  67 657  78 67 8  5645 6 
01010010101010101010101010101011111100011
1111010010010101010101010111101000100000
00000000000000011001100101010010101011111
```
Now I would like to extract only those lines which start with [numeric]: from the file. It is not always only the first two, it could also be the first 7 or 8 or whatever. How would I read in this file and output a file only containing the lines with [numeric]:?
- Tim about 9 years
  
  Please don't damage posts.
- user3069326 about 9 years
  
  this shoudl eb deleted
- Tim about 9 years
  
  No. It. Shouldn't. What is wrong with it? Do not attempt to remove valid posts, it's vandalism. It's also spelt "This should be deleted".
Eliah Kagan about 9 years

In addition to [non-numeric] (as you say), this will also print lines containing only [, or starting with [ with no matching ] to close it or no : afterwards. (That might or might not be considered desirable.)
boardrider about 9 years

@Eliah, as user3069326 knows the structure of the input file, he's in a good position to ascertain if my suggestion is valid.