Remove lines from tab-delimited file with missing values

6,577

Solution 1

If your fields can never contain whitespace, an empty field means either a tab as a first character (^\t), a tab as the last character (\t$) or two consecutive tabs (\t\t). You could therefore filter out lines containing any of those:

grep -Ev $'^\t|\t\t|\t$' file

If you can have whitespace, things get more complex. If your fields can begin with spaces, use this instead (it considers a field with only spaces to be empty):

grep -Pv '\t\s*(\t|$)|\t$|^\t' file

The change filters out lines matching a tab followed by 0 or more spaces and then either another tab or the end of the line.

That will also fail if the last field contains nothing but spaces. To avoid that too, use perl with the -F and -a options to split input into the @F array, telling it to print unless one of the fields is empty (/^$/):

perl -F'\t' -lane 'print unless grep{/^$/} @F' file

Solution 2

With awk:

awk -F"\t" '$1!=""&&$2!=""&&$3!=""' file

Actually it is that simple.

  • awk splits the input at the field separator tab \t specified with the -F flag. This could also be omitted, when your content has no spaces in the fields.
  • $1!=""&&... is a condition. When this condition is true, awk simply prints the line. You could also write '$1!=""&&$2!=""&&$3!=""{print}', but that's not necessary. Awks default behavior is to print the line, when no action is given. Here, that condition is true when the fields $1, $2 and $3 all are not empty, hence when the first 3 fields have a value.

To write to another file use this:

awk -F"\t" '$1!=""&&$2!=""&&$3!=""' input_file >output_file

Edit: With an undefined number of columns you could use this awk, it check every field in the line:

awk -F"\t" '{for(i=1;i<=NF;i++){if($i==""){next}}}1' file

Solution 3

...for any of the below to work you must first do...

t=$(printf \\t)          ### because it's hard to demo CTRL+V TAB 

...now, with a POSIX grep...

grep -Ev "^$t+|$t($t|$)"     <in >out

grep will select lines which do not match the pattern - which uses the | or metacharacter to denote a ^head-of-line tab, or two consecutive tabs, or a $tail-of-line tab - which are the only possible fail cases as near as I can tell.

without the -vnegation switch it might be:

grep -E "([^$t]+$t){2}[^$t]" <in >out

...which specifies an {occurrence count} for the (pattern group) of + one or more chars in the [class] of characters which are ^ not tabs followed by a tab.


...or with a POSIX sed...

sed -ne"s/[^$t][^$t]*/&/3p"  <in >out

...or...

sed -ne"s/[^$t]\{1,\}/&/3p"  <in >out

...or w/ GNU or BSD seds...

sed -Ene"s/[^$t]+/&/3p"      <in >out

...where sed does -not print by default any line unless it can s///ubstitute for &itself the third occurrence on a line of the longest possible sequence of at least one [^not tab] character.


(using literal tabs should be preferred for portability. the original version of this answer used \backslash escapes and it was not helpful. definitely using \backslash escapes in a [character class] will limit the applicability of your code.)

Solution 4

awk 'NF==3' file

Prints a line if number of fields equals 3. Pretty straightforward to change number of columns according to your data.

However as pointed out, this doesn't work with OP's requirement of variable number of fields.

Share:
6,577

Related videos on Youtube

Gaius Augustus
Author by

Gaius Augustus

Current Position: Graduate Student in Cancer Biology Graduate Interdisciplinary Program at University of Arizona Current Project: Genomics of colorectal cancer in African Americans Past Awesomeness: Filmmaker from 1998-2006 Digital Artist from 2001-current Freelance photographer from 2009-2014 Web designer and developer from 2001-current

Updated on September 18, 2022

Comments

  • Gaius Augustus
    Gaius Augustus over 1 year

    I have a large (~900MB) tab-delimited text file that I will be processing in a downstream program. I need to delete any line with a missing value. The correct number of columns are on each line (so a missing value would correspond to 2 tabs).

    Note: My actual data has ~2 million lines and 80-300 columns. Possible characters are a-z A-Z 0-9 - (hyphen) _ (underscore) and tab (delimited). No spaces or special characters are in the file.

    I am new to this kind of scripting, so an explanation of any code provided would be appreciated. I normally use R, but my filesizes have outgrown the data manipulation functionality of R.

    How can I at the terminal (or within a shell script) delete lines with missing values from a file (e.g. using sed)?

    Example Input File:

    Col1    Col2    Col3
    A        B        C
    D                 F
    G        H        I
    J        K        
    

    Example Output File:

    Col1    Col2    Col3
    A        B        C
    G        H        I 
    
  • mikeserv
    mikeserv over 8 years
    @terdon - it works... (you might try using literal tabs though, as recommended. i have an idea).
  • terdon
    terdon over 8 years
    Nope. Still fails if the 1st field is empty. Actually, it looks like grep -E can't understand \t. Try printf 'foo\tbar\n' | grep -E '\t'. You probably need to use $'\t' instead.
  • terdon
    terdon over 8 years
    This fails if the missing field is the last one because your split will also include the trailing \n in the field.
  • mikeserv
    mikeserv over 8 years
    @terdon - it works. even if the first field is empty. you should try using an actual tab.
  • terdon
    terdon over 8 years
    I was using actual tabs, I don't know what you were using. It does work now that you've used a variable for the tab.
  • terdon
    terdon over 8 years
    +1 but why not just negate the [[:space:]] class instead of attempting to define what's allowed? What if a field contains non alphanumeric characters? Also, you might want to use an actual tab instead of just whitespace since fields could contains spaces. Something like this would be safer: grep '^[^[:space:]]\+'$'\t''[^[:space:]]\+'$'\t''[^[:space:]]\+$' file.
  • mikeserv
    mikeserv over 8 years
    @terdon... that's really weird... it doesn't help that the data comes in spaces at an off-8 setting.
  • terdon
    terdon over 8 years
    The OP said the data was tab separated. Don't copy/paste from the question, run it through sed 's/ */\t/g', You might also want to use grep -Ev ^$'\t''+|'$'\t''('$'\t''|$)' file or something, to avoid the variable.
  • mikeserv
    mikeserv over 8 years
    @terdon - i actually used unexpand -a - which is an expressly purposed command for compressing tabs at 8-char tab points - that's all it does (unless you specify -t[tab size/list]). Anyway, using sed like that doesn't squeeze the spaces at tabs - it just squeezes spaces to tabs. as for the $'' thing... um, no thanks. using a variable is how i prefer to do it. still: grep -Ev $'^\t+|\t(\t|$)' should be fine...
  • Smiley
    Smiley over 8 years
    @terdon A valid observation and a nice alternative solution! Personally, I wanted to keep it simple. Moreover, the OP says he is "new to this kind of scripting". Nevertheless, could you please explain what the intermediate $ signs in your grep does?
  • terdon
    terdon over 8 years
    They let grep understand tabs. Vanilla grep doesn't understand \t as a tab but does understand $'\t' since that's expanded by the shell and not grep itself. I just learned, however, that you don't need to break the quotes each time as I did, you can just add the $ at the beginning: grep $'^[^[:space:]]\+\t[^[:space:]]\+\t[^[:space:]]\+$' file. This will still break if any of the fields consists entirely of spaces though.
  • mikeserv
    mikeserv over 8 years
    @terdon - grep does understand tabs. some greps don't understand backslash+t.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 8 years
    The fields are tab-delimited, not whitespace-delimited. Your approach can fail if a field contains a space.
  • Gilles 'SO- stop being evil'
    Gilles 'SO- stop being evil' over 8 years
    What's all this about [[:alnum:]] and [[:space:]]? The question specifies tab-delimited fields.
  • Sobrique
    Sobrique over 8 years
    But none the less, works fine given the example data.
  • Gaius Augustus
    Gaius Augustus over 8 years
    If there are more than 3 columns, it is necessary to list all columns, correct? My data may have anywhere from 75-500 columns, so while a great answer for my toy data, may not work as is for my actual data. Any thoughts?
  • Gaius Augustus
    Gaius Augustus over 8 years
    My data has no spaces in it, so this may work. So, I just need to make @F== set to the number of columns in my file, correct? Can I then output this to a file? perl -lane 'print if @F==n' > output.txt?
  • Gaius Augustus
    Gaius Augustus over 8 years
    I like that the perl answer is simple to understand and works no matter how many columns I have.
  • Gaius Augustus
    Gaius Augustus over 8 years
    @terdon I just tried to run the perl on my data. I happen to know that about 700,000 lines do NOT contain missing data. I used wc -l to compare the number of lines, and this version of the perl did not remove any lines. The answer further down perl -lane 'print if @F==81' file removed the correct number of lines. I'd like to be able to use this if possible though (I need to do this again and would like something that works with any number of columns)...any thoughts?
  • mikeserv
    mikeserv over 8 years
    @GaiusAugustus - the first grep in my answer should work regardless of column count - if there are two adjacent tabs anywhere on a line it will not print the line. im curious to know what it would do w/ wc -l. it should also be considerably faster than awk or perl.
  • Gaius Augustus
    Gaius Augustus over 8 years
    I had to reread the first grep a few times, but I now understand it. I tried it, and it outputs the correct number of rows. Thank you for the clarification.
  • Gaius Augustus
    Gaius Augustus over 8 years
    Tried your first grep grep -Ev "^$t+|$t($t|$)" <in.txt >out.txt and the output of wc -l is 0. Then tried your commented: grep -Ev $'^\t+|\t(\t|$)' <in.txt >out.txt and the output of wc -l is the original number of lines (~2 million).
  • mikeserv
    mikeserv over 8 years
    @GaiusAugustus - the output would be zero if you didn't first do t=$(printf \\t). the second thing i cant explain and doesn't make much sense to me. it shouldnt work like that at all - especially if the grep -P thing did work. unless there is some problem w/ $'' - which is possible - and which is why i recommend t=$(printf \\t) first. so: t=$(printf \\t);grep -Ev "^$t+|$t($t|$)" <in.txt >out.txt
  • Gaius Augustus
    Gaius Augustus over 8 years
    @mikeserv Sorry, I did the t=$(printf \\t) and then your first grep. Output of wc -l is still the original number of rows (~2 million)
  • mikeserv
    mikeserv over 8 years
    @GaiusAugustus - in that case im wrong. and i dunno why. thats annoying. thanks for trying. its really strange in that the grep -P did work - which is essentially the same as this answer.... hey wait? where the fields are blank - are there spaces? because that would explain it - and it would explain why the input was at an off-8 setting.
  • mikeserv
    mikeserv over 8 years
    @GaiusAugustus - if I'm right then t=$(printf \\t);grep -Ev " ($t|$)" <in.txt >out.txt will work for you. if not im stumped.
  • Sobrique
    Sobrique over 8 years
    Yes. Correct. You could probably auto set it based on the header row. Something like '$n //= @F; print if @F == $n'
  • chaos
    chaos over 8 years
    @GaiusAugustus See my edit.
  • terdon
    terdon over 8 years
    This doesn't work. If a field is empty, NF will still be 3. It looks like it works because you're not using the correct delimiter. The OP has tab-separated data so there could be spaces within the fields. Your approach will skip such lines. Try, for example: ` printf 'a b\tc\td\n' | awk 'NF==3'. That prints a line with 3 fields but your awk` will think there are 4.
  • Joe
    Joe over 8 years
    Quote: "Possible characters are a-z A-Z 0-9 - (hyphen) _ (underscore) and tab (delimited). No spaces or special characters are in the file."
  • terdon
    terdon over 8 years
    Ah, damn, good point. That wasn't in the original version of the question which is why I hadn't posted precisely this answer :).
  • terdon
    terdon over 8 years
    I'm afraid this still fails if the last field contains spaces but the OP has now clarified that is not a possibility.
  • Joe
    Joe over 8 years
    Oh right, that's probably why every other answer has such a complex approach. :)
  • terdon
    terdon over 8 years
    There's also the point the OP made in this comment. His lines have a variable number of fields, so this won't work on his actual data.
  • terdon
    terdon over 8 years
    @GaiusAugustus the perl won't change the original file, you need to redirect its output to a new file or use -i. However, now that you've specified that your data can never contain spaces, you can simply use time grep -Ev $'^\t|\t\t|\t$' file1> out4 which is blindingly fast and works on an arbitrary number of fields.
  • rotarydial
    rotarydial over 3 years
    This was helpful, but didn't work for me when checking whether the last column was null in a text file from Windows. In my case, the ninth column was last and was what I needed to skip when null. The condition $9!="\r" was the only thing that worked.