split a fasta file and rename on the basis of first line

python linux split fasta

10,208

Solution 1

Since you indicate you're on a Linux box 'awk' seems to be the right tool for the job.

USAGE:
./foo.awk your_input_file

foo.awk:

#!/usr/bin/awk -f

/^>chr/ {
    OUT=substr($0,2) ".fa"
}

OUT {
    print >OUT
}

You can do that also in one line:

awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input

Solution 2

If you find yourself wanting to do anything more complicated with FASTA/FASTQ files, you should consider Biopython.

Here's a post about modifying and re-writing FASTQ files: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

And another about splitting up FASTA files: http://lists.open-bio.org/pipermail/biopython/2012-July/008102.html

Solution 3

Slightly messy script, but should work on a large file as it only reads one line at a time

To run, you do python thescript.py input.txt (or it'll read from stdin, like cat input.txt | python thescript.py)

import sys
import fileinput

in_file = False

for line in fileinput.input():
    if line.startswith(">"):
        # Close current file
        if in_file:
            f.close()

        # Make new filename
        fname = line.rstrip().partition(">")[2]
        fname = "%s.fa" % fname

        # Open new file
        f = open(fname, "w")
        in_file = True

        # Write current line
        f.write(line)

    elif in_file:
        # Write line to currently open file
        f.write(line)

    else:
        # Something went wrong, no ">chr1" found yet
        print >>sys.stderr, "Line %r encountered, but no preceeding > line found"

Solution 4

Your best bet would be to use the fastaexplode program from the exonerate suite:

$ fastaexplode -h
fastaexplode from exonerate version 2.2.0
Using glib version 2.30.2
Built on Jan 12 2012
Branch: unnamed branch

fastaexplode: Split a fasta file up into individual sequences
Guy St.C. Slater. [email protected]. 2000-2003.

Synopsis:
--------
fastaexplode <path>

General Options:
---------------
-h --shorthelp [FALSE] <TRUE>
   --help [FALSE] 
-v --version [FALSE] 

Sequence Input Options:
----------------------
-f --fasta [mandatory]  <*** not set ***>
-d --directory [.] 

--

View more solutions

10,208

Author by

learner

Updated on November 22, 2022

Comments

learner over 1 year
I have a huge file with following content:

filename: input.txt
```
>chr1
jdlfnhl
dh,ndh
dnh.

dhjl

>chr2
dhfl
dhl
dh;l

>chr3

shgl
sgl

>chr2_random
dgld
```
I need to split this file in such a way that I get four separate file as below:

file 1: chr1.fa
```
>chr1
jdlfnhl
dh,ndh
dnh.

dhjl
```
file 2: chr2.fa
```
>chr2
dhfl
dhl
dh;l
```
file 3: chr3.fa
```
>chr3

shgl
sgl
```
file 4: chr2_random.fa
```
>chr2_random
dgld
```
I tried csplit in linux, but could not rename them by the text immediately after ">".
```
csplit -z input.txt '/>/' '{*}'
```
dbr over 11 years

The indentation is missing, as is the : on else
dbr over 11 years

The question mentions a "huge" file, potentially loading the entire thing at once might use too much memory... or maybe not, so +1
perilbrain over 11 years

Ya I tried editing but its getting request timeout,Dont know whats wrong with opera :(
tzelleke over 11 years

I forgot that - fixed it now.
tzelleke over 11 years

There' another point: this solution retains the blank lines at the end of subfiles. Is that an issue?
Perlnika about 11 years

Thanks, this works great. Please, do you know how to modify it in order to skip those files that would contain string "random" in their names?
tzelleke about 11 years

use this: !(OUT ~/random/) {print >OUT}
Lenna over 10 years

str.format() was introduced in 2.6 docs