Remove lines from tab-delimited file with missing values
Solution 1
If your fields can never contain whitespace, an empty field means either a tab as a first character (^\t
), a tab as the last character (\t$
) or two consecutive tabs (\t\t
). You could therefore filter out lines containing any of those:
grep -Ev $'^\t|\t\t|\t$' file
If you can have whitespace, things get more complex. If your fields can begin with spaces, use this instead (it considers a field with only spaces to be empty):
grep -Pv '\t\s*(\t|$)|\t$|^\t' file
The change filters out lines matching a tab followed by 0 or more spaces and then either another tab or the end of the line.
That will also fail if the last field contains nothing but spaces. To avoid that too, use perl
with the -F
and -a
options to split input into the @F
array, telling it to print unless one of the fields is empty (/^$/
):
perl -F'\t' -lane 'print unless grep{/^$/} @F' file
Solution 2
With awk
:
awk -F"\t" '$1!=""&&$2!=""&&$3!=""' file
Actually it is that simple.
awk
splits the input at the field separator tab\t
specified with the-F
flag. This could also be omitted, when your content has no spaces in the fields.$1!=""&&...
is a condition. When this condition is true,awk
simply prints the line. You could also write'$1!=""&&$2!=""&&$3!=""{print}'
, but that's not necessary. Awks default behavior is to print the line, when no action is given. Here, that condition is true when the fields$1
,$2
and$3
all are not empty, hence when the first 3 fields have a value.
To write to another file use this:
awk -F"\t" '$1!=""&&$2!=""&&$3!=""' input_file >output_file
Edit: With an undefined number of columns you could use this awk
, it check every field in the line:
awk -F"\t" '{for(i=1;i<=NF;i++){if($i==""){next}}}1' file
Solution 3
...for any of the below to work you must first do...
t=$(printf \\t) ### because it's hard to demo CTRL+V TAB
...now, with a POSIX grep
...
grep -Ev "^$t+|$t($t|$)" <in >out
grep
will select lines which do not match the pattern - which uses the |
or metacharacter to denote a ^
head-of-line tab, or two consecutive tabs, or a $
tail-of-line tab - which are the only possible fail cases as near as I can tell.
without the -v
negation switch it might be:
grep -E "([^$t]+$t){2}[^$t]" <in >out
...which specifies an {
occurrence count}
for the (
pattern group)
of +
one or more chars in the [
class]
of characters which are ^
not tabs followed by a tab.
...or with a POSIX sed
...
sed -ne"s/[^$t][^$t]*/&/3p" <in >out
...or...
sed -ne"s/[^$t]\{1,\}/&/3p" <in >out
...or w/ GNU or BSD sed
s...
sed -Ene"s/[^$t]+/&/3p" <in >out
...where sed
does -n
ot print by default any line unless it can s///
ubstitute for &
itself the third occurrence on a line of the longest possible sequence of at least one [^
not tab]
character.
(using literal tabs should be preferred for portability. the original version of this answer used \
backslash escapes and it was not helpful. definitely using \
backslash escapes in a [
character class]
will limit the applicability of your code.)
Solution 4
awk 'NF==3' file
Prints a line if number of fields equals 3. Pretty straightforward to change number of columns according to your data.
However as pointed out, this doesn't work with OP's requirement of variable number of fields.
Related videos on Youtube
Gaius Augustus
Current Position: Graduate Student in Cancer Biology Graduate Interdisciplinary Program at University of Arizona Current Project: Genomics of colorectal cancer in African Americans Past Awesomeness: Filmmaker from 1998-2006 Digital Artist from 2001-current Freelance photographer from 2009-2014 Web designer and developer from 2001-current
Updated on September 18, 2022Comments
-
Gaius Augustus over 1 year
I have a large (~900MB) tab-delimited text file that I will be processing in a downstream program. I need to delete any line with a missing value. The correct number of columns are on each line (so a missing value would correspond to 2 tabs).
Note: My actual data has ~2 million lines and 80-300 columns. Possible characters are a-z A-Z 0-9 - (hyphen) _ (underscore) and tab (delimited). No spaces or special characters are in the file.
I am new to this kind of scripting, so an explanation of any code provided would be appreciated. I normally use R, but my filesizes have outgrown the data manipulation functionality of R.
How can I at the terminal (or within a shell script) delete lines with missing values from a file (e.g. using
sed
)?Example Input File:
Col1 Col2 Col3 A B C D F G H I J K
Example Output File:
Col1 Col2 Col3 A B C G H I
-
mikeserv over 8 years@terdon - it works... (you might try using literal tabs though, as recommended. i have an idea).
-
terdon over 8 yearsNope. Still fails if the 1st field is empty. Actually, it looks like
grep -E
can't understand\t
. Tryprintf 'foo\tbar\n' | grep -E '\t'
. You probably need to use$'\t'
instead. -
terdon over 8 yearsThis fails if the missing field is the last one because your
split
will also include the trailing\n
in the field. -
mikeserv over 8 years@terdon - it works. even if the first field is empty. you should try using an actual tab.
-
terdon over 8 yearsI was using actual tabs, I don't know what you were using. It does work now that you've used a variable for the tab.
-
terdon over 8 years+1 but why not just negate the
[[:space:]]
class instead of attempting to define what's allowed? What if a field contains non alphanumeric characters? Also, you might want to use an actual tab instead of just whitespace since fields could contains spaces. Something like this would be safer:grep '^[^[:space:]]\+'$'\t''[^[:space:]]\+'$'\t''[^[:space:]]\+$' file
. -
mikeserv over 8 years@terdon... that's really weird... it doesn't help that the data comes in spaces at an off-8 setting.
-
terdon over 8 yearsThe OP said the data was tab separated. Don't copy/paste from the question, run it through
sed 's/ */\t/g'
, You might also want to usegrep -Ev ^$'\t''+|'$'\t''('$'\t''|$)' file
or something, to avoid the variable. -
mikeserv over 8 years@terdon - i actually used
unexpand -a
- which is an expressly purposed command for compressing tabs at 8-char tab points - that's all it does (unless you specify-t[tab size/list]
). Anyway, usingsed
like that doesn't squeeze the spaces at tabs - it just squeezes spaces to tabs. as for the$''
thing... um, no thanks. using a variable is how i prefer to do it. still:grep -Ev $'^\t+|\t(\t|$)'
should be fine... -
Smiley over 8 years@terdon A valid observation and a nice alternative solution! Personally, I wanted to keep it simple. Moreover, the OP says he is "new to this kind of scripting". Nevertheless, could you please explain what the intermediate
$
signs in yourgrep
does? -
terdon over 8 yearsThey let
grep
understand tabs. Vanillagrep
doesn't understand\t
as a tab but does understand$'\t'
since that's expanded by the shell and notgrep
itself. I just learned, however, that you don't need to break the quotes each time as I did, you can just add the$
at the beginning:grep $'^[^[:space:]]\+\t[^[:space:]]\+\t[^[:space:]]\+$' file
. This will still break if any of the fields consists entirely of spaces though. -
mikeserv over 8 years@terdon -
grep
does understand tabs. somegrep
s don't understand backslash+t. -
Gilles 'SO- stop being evil' over 8 yearsThe fields are tab-delimited, not whitespace-delimited. Your approach can fail if a field contains a space.
-
Gilles 'SO- stop being evil' over 8 yearsWhat's all this about
[[:alnum:]]
and[[:space:]]
? The question specifies tab-delimited fields. -
Sobrique over 8 yearsBut none the less, works fine given the example data.
-
Gaius Augustus over 8 yearsIf there are more than 3 columns, it is necessary to list all columns, correct? My data may have anywhere from 75-500 columns, so while a great answer for my toy data, may not work as is for my actual data. Any thoughts?
-
Gaius Augustus over 8 yearsMy data has no spaces in it, so this may work. So, I just need to make
@F==
set to the number of columns in my file, correct? Can I then output this to a file?perl -lane 'print if @F==n' > output.txt
? -
Gaius Augustus over 8 yearsI like that the perl answer is simple to understand and works no matter how many columns I have.
-
Gaius Augustus over 8 years@terdon I just tried to run the perl on my data. I happen to know that about 700,000 lines do NOT contain missing data. I used
wc -l
to compare the number of lines, and this version of the perl did not remove any lines. The answer further downperl -lane 'print if @F==81' file
removed the correct number of lines. I'd like to be able to use this if possible though (I need to do this again and would like something that works with any number of columns)...any thoughts? -
mikeserv over 8 years@GaiusAugustus - the first
grep
in my answer should work regardless of column count - if there are two adjacent tabs anywhere on a line it will not print the line. im curious to know what it would do w/wc -l
. it should also be considerably faster thanawk
orperl
. -
Gaius Augustus over 8 yearsI had to reread the first grep a few times, but I now understand it. I tried it, and it outputs the correct number of rows. Thank you for the clarification.
-
Gaius Augustus over 8 yearsTried your first grep
grep -Ev "^$t+|$t($t|$)" <in.txt >out.txt
and the output ofwc -l
is 0. Then tried your commented:grep -Ev $'^\t+|\t(\t|$)' <in.txt >out.txt
and the output ofwc -l
is the original number of lines (~2 million). -
mikeserv over 8 years@GaiusAugustus - the output would be zero if you didn't first do
t=$(printf \\t)
. the second thing i cant explain and doesn't make much sense to me. it shouldnt work like that at all - especially if thegrep -P
thing did work. unless there is some problem w/$''
- which is possible - and which is why i recommendt=$(printf \\t)
first. so:t=$(printf \\t);grep -Ev "^$t+|$t($t|$)" <in.txt >out.txt
-
Gaius Augustus over 8 years@mikeserv Sorry, I did the
t=$(printf \\t)
and then your first grep. Output ofwc -l
is still the original number of rows (~2 million) -
mikeserv over 8 years@GaiusAugustus - in that case im wrong. and i dunno why. thats annoying. thanks for trying. its really strange in that the
grep -P
did work - which is essentially the same as this answer.... hey wait? where the fields are blank - are there spaces? because that would explain it - and it would explain why the input was at an off-8 setting. -
mikeserv over 8 years@GaiusAugustus - if I'm right then
t=$(printf \\t);grep -Ev " ($t|$)" <in.txt >out.txt
will work for you. if not im stumped. -
Sobrique over 8 yearsYes. Correct. You could probably auto set it based on the header row. Something like '$n //= @F; print if @F == $n'
-
chaos over 8 years@GaiusAugustus See my edit.
-
terdon over 8 yearsThis doesn't work. If a field is empty,
NF
will still be 3. It looks like it works because you're not using the correct delimiter. The OP has tab-separated data so there could be spaces within the fields. Your approach will skip such lines. Try, for example: ` printf 'a b\tc\td\n' | awk 'NF==3'. That prints a line with 3 fields but your
awk` will think there are 4. -
Joe over 8 yearsQuote: "Possible characters are a-z A-Z 0-9 - (hyphen) _ (underscore) and tab (delimited). No spaces or special characters are in the file."
-
terdon over 8 yearsAh, damn, good point. That wasn't in the original version of the question which is why I hadn't posted precisely this answer :).
-
terdon over 8 yearsI'm afraid this still fails if the last field contains spaces but the OP has now clarified that is not a possibility.
-
Joe over 8 yearsOh right, that's probably why every other answer has such a complex approach. :)
-
terdon over 8 yearsThere's also the point the OP made in this comment. His lines have a variable number of fields, so this won't work on his actual data.
-
terdon over 8 years@GaiusAugustus the perl won't change the original file, you need to redirect its output to a new file or use
-i
. However, now that you've specified that your data can never contain spaces, you can simply usetime grep -Ev $'^\t|\t\t|\t$' file1> out4
which is blindingly fast and works on an arbitrary number of fields. -
rotarydial over 3 yearsThis was helpful, but didn't work for me when checking whether the last column was null in a text file from Windows. In my case, the ninth column was last and was what I needed to skip when null. The condition
$9!="\r"
was the only thing that worked.