Removal of lines with no more or fewer than 'N' fields?
Solution 1
You almost have it already:
awk -F'\t' 'NF==13 {print}' infile > newfile
And, if you're on one of those systems where you're charged by the keystroke ( :) ) you can shorten that to
awk -F'\t' 'NF==13' infile > newfile
To do multiple files in one sweep,
and to actually change the files (and not just create new files),
identify a filename thats not in use (for example, scharf
),
and perform a loop, like this:
for f in list do awk -F'\t' 'NF==13 {print}' "$f" > scharf && mv -f -- scharf "$f" done
The list
can be one or more filenames
and/or wildcard filename expansion patterns; for example,
for f in blue.data green.data *.dat orange.data red.data /ultra/violet.dat
The mv
command overwrites the input file (e.g., blue.data
)
with the temporary scharf
file
(which has only the lines from the input file with 13 fields).
(Be sure this is what you want to do, and be careful.
To be safe, you should probably back up your data first.)
The -f
tells mv
to overwrite the input file,
even though it already exists.
The --
protects you against weirdness
if any of your files has a name beginning with -
.
Solution 2
Since this is a large file, it may be worth using a slightly more complex tool for a performance gain. Usually, specialized tools are faster than generalist tools. For example, solving the same problem with cut
tends to be faster than grep
which tends to be faster than sed
which tends to be faster than awk
(the flip side being that later tools can do things that earlier ones can't).
You want to remove lines with 13 tab characters or more, so:
LC_ALL=C grep -Ev '(␉.*){13}'
or maybe (I don't expect a measurable performance difference)
LC_ALL=C grep -Ev '(␉.*){12}␉'
where ␉
is a literal tab character. Setting the locale to C
isn't necessary, but speeds up some versions of GNU grep compared with multibyte locales.
Solution 3
With perl
:
perl -F'\t' -anle 'print if @F == 13' file
to edit inplace, add -i
option:
perl -i.bak -F'\t' -anle 'print if @F == 13' file
Related videos on Youtube
T. Scharf
Updated on September 18, 2022Comments
-
T. Scharf over 1 year
I am working on mac with sed, perl, awk, bash..
I have a large-ish (10GB) text file which has 13 fields (columns) of
TAB
delimited data. Unfortunately some of these lines have extraneousTABs
, so I want to delete the entire line where we have extraTABs
, and thus unequal fields. (I don't mind discarding the lines in their entirety)What I currently have writes the number of fields into another file.
awk -F'\t' '{print NF}' infile > fieldCount head fieldCount 13 13 10 13 13 13 14 13 13 13
I would like to construct a short script that removes any line with more (or less) than 13 proper fields (from the original file).
- speed is helpful as I have to do this on multiple files
- doing it in one sweep would be cool
- I currently am porting the fieldCount file into Python, trying to load with line by line.
EDIT:
vaild (13 columns)
a b c d e f g h i j k l m
invalid (14 columns)
a b c d e f g h i j k l m n
-
Admin over 9 yearsCan you give example which valid line and invalid line?
-
T. Scharf over 9 yearsjust a little push was all i needed -thx partner
-
cuonglm over 9 years@T.Scharf: I think mine is better if you want to do with multiple files. But don't mind to chose what is the best for you.