Sort a file by first (or second, or else) column in python
Solution 1
The problem you're having is that you're not turning each line into a list. When you read in the file, you're just getting the whole line as a string. You're then sorting by the first character of each line, and this is always the same character in your input, 'E'
.
To just sort by the first column, you need to split the first block off and just read that section. So your key should be this:
for line in sorted(lines, key=lambda line: line.split()[0]):
split
will turn your line into a list, and then the first column is taken from that list.
Solution 2
If your input file is tab-separated, you can also use the csv module.
import csv
from operator import itemgetter
reader = csv.reader(open("t.txt"), delimiter="\t")
for line in sorted(reader, key=itemgetter(0)):
print(line)
sorts by first column.
Change the number in
key=itemgetter(0)
for sorting by a different column.
Solution 3
Same idea as SuperBiasedMan, but I prefer this approach: if you want another way of sorting (for example: if first column matches, sort by second, then third, etc) it is more easily implemented
with open(my_file) as f:
lines = [line.split(' ') for line in f]
output = open("result.txt", 'w')
for line in sorted(lines):
output.write(' '.join(line), key=itemgetter(0))
output.close()
Solution 4
You can write a function that takes a filename, delimiter and column to sort by using csv.reader
to parse the file:
from operator import itemgetter
import csv
def sort_by(fle,col,delim):
with open(fle) as f:
r = csv.reader(f, delim=delim)
for row in sorted(r, key=itemgetter(col)):
yield row
for row in sort_by("your_file",2, "\t"):
print(row)
Solution 5
You can do this quickly with pandas as follows, with the data file set up exactly as you show it (i.e., with variable spaces as separators):
import pandas as pd
df = pd.read_csv('csvdata.csv', sep=' ', skipinitialspace=True, header=None)
df.sort(columns=[0], inplace=True)
df.to_csv('sorted_csvdata.csv', header=None, index=None)
Just to check the result:
with open('sorted_csvdata.csv', 'r') as f:
print(f.read())
ENSMUSG00000028255,145003817,145032776,3,protein_coding
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000077677,101186764,101186867,4,snRNA
ENSMUSG00000088009,83405631,83405764,14,snoRNA
ENSMUSG00000092727,68990574,68990678,11,miRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000098481,38086202,38086317,13,miRNA
ENSMUSG00000098737,95734911,95734973,3,miRNA
You can do multi column sorting by adding additional columns to the list in the colmuns=[...] keyword argument.
Tiana
Updated on July 05, 2022Comments
-
Tiana almost 2 years
This seems a very basic question, but I am new to python, and after spending a long time trying to find a solution on my own, I thought it's time to ask some more advanced people!
So, I have a file (sample):
ENSMUSG00000098737 95734911 95734973 3 miRNA ENSMUSG00000077677 101186764 101186867 4 snRNA ENSMUSG00000092727 68990574 68990678 11 miRNA ENSMUSG00000088009 83405631 83405764 14 snoRNA ENSMUSG00000028255 145003817 145032776 3 protein_coding ENSMUSG00000028255 145003817 145032776 3 processed_transcript ENSMUSG00000028255 145003817 145032776 3 processed_transcript ENSMUSG00000098481 38086202 38086317 13 miRNA ENSMUSG00000097075 126971720 126976098 7 lincRNA ENSMUSG00000097075 126971720 126976098 7 lincRNA
and I need to write a new file with all the same information, but sorted by the first column.
What I use so far is :
lines = open(my_file, 'r').readlines() output = open("intermediate_alphabetical_order.txt", 'w') for line in sorted(lines, key=itemgetter(0)): output.write(line) output.close()
It doesn't return me any error, but just writes the output file exactly as the input file.
I know it is certainly a very basic mistake, but it would be amazing if some of you could tell me what I'm doing wrong!
Thanks a lot!
Edit
I am having trouble with the way I open the file, so the answers concerning already opened arrays don't really help.