Python: How to read file and store certain columns in array

15,997

Solution 1

This would work:

data = []
target = []
with open('faban.txt') as fobj:
    for line in fobj:
        row = line.split()
        data.append(row[:-1])
        target.append(row[-1])

Now:

>>> data
[['faban', '1', '0', '0.288'],
 ['faban', '2', '0', '0.243'],
 ['simulated', '1', '0', '0.159'],
 ['faban', '1', '1', '0.189']]

>>> target
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']

Solution 2

I think numpy has a clean, easy solution here.

>>> import numpy as np
>>> data, target = np.array_split(np.loadtxt('file', dtype=str), [-1], axis=1)

results in:

>>> data.tolist()
[['faban', '1', '0', '0.288'], 
 ['faban', '2', '0', '0.243'], 
 ['simulated', '1', '0', '0.159'], 
 ['faban', '1', '1', '0.189']]
>>> target.flatten().tolist()
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']

Solution 3

You could do that with pandas using read_table to read your data, iloc to subset your data, values to get values from DataFrame and tolist method to convert numpy array to list:

import pandas as pd
df = pd.read_table('path_to_your_file', delim_whitespace=True, header=None)
print(df)
           0  1  2      3           4
0      faban  1  0  0.288     withspy
1      faban  2  0  0.243  withoutspy
2  simulated  1  0  0.159  withoutspy
3      faban  1  1  0.189  withoutspy


data = df.iloc[:,:-1].values.tolist()
target = df.iloc[:,-1].tolist()

print(data)
[['faban', 1, 0, 0.28800000000000003],
 ['faban', 2, 0, 0.243],
 ['simulated', 1, 0, 0.159],
 ['faban', 1, 1, 0.18899999999999997]]

print(target)
['withspy', 'withoutspy', 'withoutspy', 'withoutspy']
Share:
15,997
SaadH
Author by

SaadH

Updated on June 16, 2022

Comments

  • SaadH
    SaadH almost 2 years

    I am reading a dataset (separated by whitespace) from a file. I need to store all columns apart from last one in the array data, and the last column in the array target.

    Can you guide me how to proceed further?

    That's what I have so far:

    with open(filename) as f:
        data = f.readlines()
    

    Or should I be reading line by line?

    PS: The data type of columns is also different.

    Edit: Sample Data

    faban       1   0   0.288   withspy
    faban       2   0   0.243   withoutspy
    simulated   1   0   0.159   withoutspy
    faban       1   1   0.189   withoutspy