split a fasta file and rename on the basis of first line
Solution 1
Since you indicate you're on a Linux box 'awk' seems to be the right tool for the job.
USAGE:
./foo.awk your_input_file
foo.awk:
#!/usr/bin/awk -f
/^>chr/ {
OUT=substr($0,2) ".fa"
}
OUT {
print >OUT
}
You can do that also in one line:
awk '/^>chr/ {OUT=substr($0,2) ".fa"}; OUT {print >OUT}' your_input
Solution 2
If you find yourself wanting to do anything more complicated with FASTA/FASTQ files, you should consider Biopython.
Here's a post about modifying and re-writing FASTQ files: http://news.open-bio.org/news/2009/09/biopython-fast-fastq/
And another about splitting up FASTA files: http://lists.open-bio.org/pipermail/biopython/2012-July/008102.html
Solution 3
Slightly messy script, but should work on a large file as it only reads one line at a time
To run, you do python thescript.py input.txt
(or it'll read from stdin, like cat input.txt | python thescript.py
)
import sys
import fileinput
in_file = False
for line in fileinput.input():
if line.startswith(">"):
# Close current file
if in_file:
f.close()
# Make new filename
fname = line.rstrip().partition(">")[2]
fname = "%s.fa" % fname
# Open new file
f = open(fname, "w")
in_file = True
# Write current line
f.write(line)
elif in_file:
# Write line to currently open file
f.write(line)
else:
# Something went wrong, no ">chr1" found yet
print >>sys.stderr, "Line %r encountered, but no preceeding > line found"
Solution 4
Your best bet would be to use the fastaexplode program from the exonerate suite:
$ fastaexplode -h
fastaexplode from exonerate version 2.2.0
Using glib version 2.30.2
Built on Jan 12 2012
Branch: unnamed branch
fastaexplode: Split a fasta file up into individual sequences
Guy St.C. Slater. [email protected]. 2000-2003.
Synopsis:
--------
fastaexplode <path>
General Options:
---------------
-h --shorthelp [FALSE] <TRUE>
--help [FALSE]
-v --version [FALSE]
Sequence Input Options:
----------------------
-f --fasta [mandatory] <*** not set ***>
-d --directory [.]
--
learner
Updated on November 22, 2022Comments
-
learner over 1 year
I have a huge file with following content:
filename: input.txt
>chr1 jdlfnhl dh,ndh dnh. dhjl >chr2 dhfl dhl dh;l >chr3 shgl sgl >chr2_random dgld
I need to split this file in such a way that I get four separate file as below:
file 1: chr1.fa
>chr1 jdlfnhl dh,ndh dnh. dhjl
file 2: chr2.fa
>chr2 dhfl dhl dh;l
file 3: chr3.fa
>chr3 shgl sgl
file 4: chr2_random.fa
>chr2_random dgld
I tried csplit in linux, but could not rename them by the text immediately after ">".
csplit -z input.txt '/>/' '{*}'
-
dbr over 11 yearsThe indentation is missing, as is the
:
onelse
-
dbr over 11 yearsThe question mentions a "huge" file, potentially loading the entire thing at once might use too much memory... or maybe not, so +1
-
perilbrain over 11 yearsYa I tried editing but its getting request timeout,Dont know whats wrong with opera :(
-
tzelleke over 11 yearsI forgot that - fixed it now.
-
tzelleke over 11 yearsThere' another point: this solution retains the blank lines at the end of subfiles. Is that an issue?
-
Perlnika about 11 yearsThanks, this works great. Please, do you know how to modify it in order to skip those files that would contain string "random" in their names?
-
tzelleke about 11 yearsuse this:
!(OUT ~/random/) {print >OUT}
-
Lenna over 10 years
str.format()
was introduced in 2.6 docs