Remove line breaks in a FASTA file

23,647

Solution 1

This awk program:

% awk '!/^>/ { printf "%s", $0; n = "\n" } 
/^>/ { print n $0; n = "" }
END { printf "%s", n }
' input.fasta

Will yield:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

On lines that don't start with a >, print the line without a line break and store a newline character (in variable n) for later.

On lines that do start with a >, print the stored newline character (if any) and the line. Reset n, in case this is the last line.

End with a newline, if required.

Note:

By default, variables are initialized to the empty string. There is no need to explicitly "initialize" a variable in , which is what you would do in and in most other traditional languages.

--6.1.3.1 Using Variables in a Program, The GNU Awk User's Guide

Solution 2

The accepted solution is fine, but it's not particularly AWKish. Consider using this instead:

 awk '/^>/ { print (NR==1 ? "" : RS) $0; next } { printf "%s", $0 } END { printf RS }' file

Explanation:

For lines beginning with >, print the line. A ternary operator is used to print a leading newline character if the line is not the first in the file. For lines not beginning with >, print the line without a trailing newline character. Since the last line in the file won't begin with >, use the END block to print a final newline character.

Note that the above can also be written more briefly, by setting a null output record separator, enabling default printing and re-assigning lines beginning with >. Try:

awk -v ORS= '/^>/ { $0 = (NR==1 ? "" : RS) $0 RS } END { printf RS }1' file

Solution 3

There is another awk one-liner, should work for your case.

awk '/^>/{print s? s"\n"$0:$0;s="";next}{s=s sprintf("%s",$0)}END{if(s)print s}' file

Solution 4

I would use sed for this. Using GNU sed:

sed ':a; $!N; /^>/!s/\n\([^>]\)/\1/; ta; P; D' file

Results:

>accession1
ATGGCCCATGGGATCCTAGC
>accession2
GATATCCATGAAACGGCTTA

Explanation:

Create a label, a. If the line is not the last line in the file, append it to pattern space. If the line doesn't start with the character >, perform the substitution s/\n\([^>]\)/\1/. If the substitution was successful since the last input line was read, then branch to label a. Print up to the first embedded newline of the current pattern space. If pattern space contains no newline, start a normal new cycle as if the d command was issued. Otherwise, delete text in the pattern space up to the first newline, and restart cycle with the resultant pattern space, without reading a new line of input.

Solution 5

You might be interested in bioawk, it is an adapted version of awk which is tuned to process fasta files

bioawk -c fastx '{ gsub(/\n/,"",seq); print ">"$name; print $seq }' file.fasta

Note: BioAwk is based on Brian Kernighan's awk which is documented in "The AWK Programming Language", by Al Aho, Brian Kernighan, and Peter Weinberger (Addison-Wesley, 1988, ISBN 0-201-07981-X) . I'm not sure if this version is compatible with POSIX.

Share:
23,647
chimeric
Author by

chimeric

Updated on July 09, 2022

Comments

  • chimeric
    chimeric almost 2 years

    I have a fasta file where the sequences are broken up with newlines. I'd like to remove the newlines. Here's an example of my file:

    >accession1
    ATGGCCCATG
    GGATCCTAGC
    >accession2
    GATATCCATG
    AAACGGCTTA
    

    I'd like to convert it into this:

    >accession1
    ATGGCCCATGGGATCCTAGC
    >accession2
    GATATCCATGAAACGGCTTA
    

    I found a potential solution on this site, which looks like this:

    cat input.fasta | awk '{if (substr($0,1,1)==">"){if (p){print "\n";} print $0} else printf("%s",$0);p++;}END{print "\n"}' > joinedlineoutput.fasta
    

    However, this places an extra line break between each entry, so file looks like this:

    >accession1
    ATGGCCCATGGGATCCTAGC
    
    >accession2
    GATATCCATGAAACGGCTTA
    

    I'm an awk noob, but I took a shot at modifying the command. My guess was the if (p){print "\n";} was the culprit...potentially print "\n" is adding two line breaks. I couldn't figure out how to add just one newline...this is probably something easy, but like I said, I'm a noob. Here was my (unsuccessful) solution:

    awk '{if (substr($0,1,1)==">"){print "\n"$0} else printf("%s",$0);p++;}END{print "\n"}' input.fasta > joinedoutput.fasta
    

    However, this adds an empty line at the beginning of the file because it's always printing a new line before it prints the first accession number:

    {empty line} 
    >accession1
    ATGGCCCATGGGATCCTAGC
    >accession2
    GATATCCATGAAACGGCTTA
    

    Anyone have a solution to get my file in the correct format? Thanks!