Sort a file by first (or second, or else) column in python

19,112

Solution 1

The problem you're having is that you're not turning each line into a list. When you read in the file, you're just getting the whole line as a string. You're then sorting by the first character of each line, and this is always the same character in your input, 'E'.

To just sort by the first column, you need to split the first block off and just read that section. So your key should be this:

for line in sorted(lines, key=lambda line: line.split()[0]):

split will turn your line into a list, and then the first column is taken from that list.

Solution 2

If your input file is tab-separated, you can also use the csv module.

import csv
from operator import itemgetter
reader = csv.reader(open("t.txt"), delimiter="\t")

for line in sorted(reader, key=itemgetter(0)):
    print(line)

sorts by first column.

Change the number in

key=itemgetter(0)

for sorting by a different column.

Solution 3

Same idea as SuperBiasedMan, but I prefer this approach: if you want another way of sorting (for example: if first column matches, sort by second, then third, etc) it is more easily implemented

with open(my_file) as f:
    lines = [line.split(' ') for line in f]
output = open("result.txt", 'w')

for line in sorted(lines):
    output.write(' '.join(line), key=itemgetter(0))

output.close()

Solution 4

You can write a function that takes a filename, delimiter and column to sort by using csv.reader to parse the file:

from operator import itemgetter

import  csv

def sort_by(fle,col,delim):
    with open(fle) as f:
        r = csv.reader(f, delim=delim)
        for row in sorted(r, key=itemgetter(col)):
            yield row

for row in sort_by("your_file",2, "\t"):
     print(row)

Solution 5

You can do this quickly with pandas as follows, with the data file set up exactly as you show it (i.e., with variable spaces as separators):

import pandas as pd
df = pd.read_csv('csvdata.csv', sep=' ', skipinitialspace=True, header=None)
df.sort(columns=[0], inplace=True)
df.to_csv('sorted_csvdata.csv', header=None, index=None)

Just to check the result:

with open('sorted_csvdata.csv', 'r') as f:
    print(f.read())

ENSMUSG00000028255,145003817,145032776,3,protein_coding
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000028255,145003817,145032776,3,processed_transcript
ENSMUSG00000077677,101186764,101186867,4,snRNA
ENSMUSG00000088009,83405631,83405764,14,snoRNA
ENSMUSG00000092727,68990574,68990678,11,miRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000097075,126971720,126976098,7,lincRNA
ENSMUSG00000098481,38086202,38086317,13,miRNA
ENSMUSG00000098737,95734911,95734973,3,miRNA

You can do multi column sorting by adding additional columns to the list in the colmuns=[...] keyword argument.

Share:
19,112
Tiana
Author by

Tiana

Updated on July 05, 2022

Comments

  • Tiana
    Tiana almost 2 years

    This seems a very basic question, but I am new to python, and after spending a long time trying to find a solution on my own, I thought it's time to ask some more advanced people!

    So, I have a file (sample):

    ENSMUSG00000098737  95734911    95734973    3   miRNA
    ENSMUSG00000077677  101186764   101186867   4   snRNA
    ENSMUSG00000092727  68990574    68990678    11  miRNA
    ENSMUSG00000088009  83405631    83405764    14  snoRNA
    ENSMUSG00000028255  145003817   145032776   3   protein_coding
    ENSMUSG00000028255  145003817   145032776   3   processed_transcript
    ENSMUSG00000028255  145003817   145032776   3   processed_transcript
    ENSMUSG00000098481  38086202    38086317    13  miRNA
    ENSMUSG00000097075  126971720   126976098   7   lincRNA
    ENSMUSG00000097075  126971720   126976098   7   lincRNA
    

    and I need to write a new file with all the same information, but sorted by the first column.

    What I use so far is :

    lines = open(my_file, 'r').readlines()
    output = open("intermediate_alphabetical_order.txt", 'w')
    
    for line in sorted(lines, key=itemgetter(0)):
        output.write(line)
    
    output.close()
    

    It doesn't return me any error, but just writes the output file exactly as the input file.

    I know it is certainly a very basic mistake, but it would be amazing if some of you could tell me what I'm doing wrong!

    Thanks a lot!

    Edit

    I am having trouble with the way I open the file, so the answers concerning already opened arrays don't really help.