split a fasta file and rename on the basis of first line

10,208

Solution 1

Since you indicate you're on a Linux box 'awk' seems to be the right tool for the job.

USAGE:
./foo.awk your_input_file

foo.awk:

#!/usr/bin/awk -f

/^>chr/ {
    OUT=substr($0,2) ".fa"
}

OUT {
    print >OUT
}

You can do that also in one line:

awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input

Solution 2

If you find yourself wanting to do anything more complicated with FASTA/FASTQ files, you should consider Biopython.

Here's a post about modifying and re-writing FASTQ files: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/

And another about splitting up FASTA files: http://lists.open-bio.org/pipermail/biopython/2012-July/008102.html

Solution 3

Slightly messy script, but should work on a large file as it only reads one line at a time

To run, you do python thescript.py input.txt (or it'll read from stdin, like cat input.txt | python thescript.py)

import sys
import fileinput

in_file = False

for line in fileinput.input():
    if line.startswith(">"):
        # Close current file
        if in_file:
            f.close()

        # Make new filename
        fname = line.rstrip().partition(">")[2]
        fname = "%s.fa" % fname

        # Open new file
        f = open(fname, "w")
        in_file = True

        # Write current line
        f.write(line)

    elif in_file:
        # Write line to currently open file
        f.write(line)

    else:
        # Something went wrong, no ">chr1" found yet
        print >>sys.stderr, "Line %r encountered, but no preceeding > line found"

Solution 4

Your best bet would be to use the fastaexplode program from the exonerate suite:

$ fastaexplode -h
fastaexplode from exonerate version 2.2.0
Using glib version 2.30.2
Built on Jan 12 2012
Branch: unnamed branch

fastaexplode: Split a fasta file up into individual sequences
Guy St.C. Slater. [email protected]. 2000-2003.

Synopsis:
--------
fastaexplode <path>

General Options:
---------------
-h --shorthelp [FALSE] <TRUE>
   --help [FALSE] 
-v --version [FALSE] 

Sequence Input Options:
----------------------
-f --fasta [mandatory]  <*** not set ***>
-d --directory [.] 

--
Share:
10,208
learner
Author by

learner

Updated on November 22, 2022

Comments

  • learner
    learner over 1 year

    I have a huge file with following content:

    filename: input.txt

    >chr1
    jdlfnhl
    dh,ndh
    dnh.
    
    dhjl
    
    >chr2
    dhfl
    dhl
    dh;l
    
    >chr3
    
    shgl
    sgl
    
    >chr2_random
    dgld
    

    I need to split this file in such a way that I get four separate file as below:

    file 1: chr1.fa

    >chr1
    jdlfnhl
    dh,ndh
    dnh.
    
    dhjl
    

    file 2: chr2.fa

    >chr2
    dhfl
    dhl
    dh;l
    

    file 3: chr3.fa

    >chr3
    
    shgl
    sgl
    

    file 4: chr2_random.fa

    >chr2_random
    dgld
    

    I tried csplit in linux, but could not rename them by the text immediately after ">".

    csplit -z input.txt '/>/' '{*}'
    
  • dbr
    dbr over 11 years
    The indentation is missing, as is the : on else
  • dbr
    dbr over 11 years
    The question mentions a "huge" file, potentially loading the entire thing at once might use too much memory... or maybe not, so +1
  • perilbrain
    perilbrain over 11 years
    Ya I tried editing but its getting request timeout,Dont know whats wrong with opera :(
  • tzelleke
    tzelleke over 11 years
    I forgot that - fixed it now.
  • tzelleke
    tzelleke over 11 years
    There' another point: this solution retains the blank lines at the end of subfiles. Is that an issue?
  • Perlnika
    Perlnika about 11 years
    Thanks, this works great. Please, do you know how to modify it in order to skip those files that would contain string "random" in their names?
  • tzelleke
    tzelleke about 11 years
    use this: !(OUT ~/random/) {print >OUT}
  • Lenna
    Lenna over 10 years
    str.format() was introduced in 2.6 docs