Python: Fast and efficient way of writing large text file

python python-2.7 file-io dataframe string-concatenation

15,427

Solution 1

Seems like Pandas might be a good tool for this problem. It's pretty easy to get started with pandas, and it deals well with most ways you might need to get data into python. Pandas deals well with mixed data (floats, ints, strings), and usually can detect the types on its own.

Once you have an (R-like) data frame in pandas, it's pretty straightforward to output the frame to csv.

DataFrame.to_csv(path_or_buf, sep='\t')

There's a bunch of other configuration things you can do to make your tab separated file just right.

http://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html

Solution 2

Unless you are running into a performance issue, you can probably write to the file line by line. Python internally uses buffering and will likely give you a nice compromise between performance and memory efficiency.

Python buffering is different from OS buffering and you can specify how you want things buffered by setting the buffering argument to open.

15,427

Author by

Misconstruction

Bioinformatician

Updated on June 04, 2022

Comments

Misconstruction almost 2 years

I have a speed/efficiency related question about python:

I need to write a large number of very large R dataframe-ish files, about 0.5-2 GB sizes. This is basically a large tab-separated table, where each line can contain floats, integers and strings.

Normally, I would just put all my data in numpy dataframe and use np.savetxt to save it, but since there are different data types it can't really be put into one array.

Therefore I have resorted to simply assembling the lines as strings manually, but this is a tad slow. So far I'm doing:

1) Assemble each line as a string 2) Concatenate all lines as single huge string 3) Write string to file

I have several problems with this: 1) The large number of string-concatenations ends up taking a lot of time 2) I run of of RAM to keep strings in memory 3) ...which in turn leads to more separate file.write commands, which are very slow as well.

So my question is: What is a good routine for this kind of problem? One that balances out speed vs memory-consumption for most efficient string-concatenation and writing to disk.

... or maybe this strategy is simply just bad and I should do something completely different?

Thanks in advance!
Misconstruction about 10 years

I have never tried using Pandas, so I will give it ago. However, in some cases the actual text file will be a bit more messy than a dataframe
Misconstruction about 10 years

How can you set buffering arguments?
Clarus about 10 years

open("foo", "wb", 1024) would set a 1k buffer. 0 and 1 have special meaning.
Michael Mior about 10 years

You shouldn't really need to set the size of buffers manually. The defaults work well most of the time.
Joan Smith about 10 years

@Misconstruction, if this answered your question, please mark as answered with the check mark.