Python writelines() and write() huge time difference

17,231

Solution 1

file.writelines() expects an iterable of strings. It then proceeds to loop and call file.write() for each string in the iterable. In Python, the method does this:

def writelines(self, lines)
    for line in lines:
        self.write(line)

You are passing in a single large string, and a string is an iterable of strings too. When iterating you get individual characters, strings of length 1. So in effect you are making len(data) separate calls to file.write(). And that is slow, because you are building up a write buffer a single character at a time.

Don't pass in a single string to file.writelines(). Pass in a list or tuple or other iterable instead.

You could send in individual lines with added newline in a generator expression, for example:

 myWrite.writelines(line + '\n' for line in new_my_list)

Now, if you could make clean_data() a generator, yielding cleaned lines, you could stream data from the input file, through your data cleaning generator, and out to the output file without using any more memory than is required for the read and write buffers and however much state is needed to clean your lines:

with open(inputPath, 'r+') as myRead, open(outPath, 'w+') as myWrite:
    myWrite.writelines(line + '\n' for line in clean_data(myRead))

In addition, I'd consider updating clean_data() to emit lines with newlines included.

Solution 2

as a complement to Martijn answer, the best way would be to avoid to build the list using join in the first place

Just pass a generator comprehension to writelines, adding the newline in the end: no unnecessary memory allocation and no loop (besides the comprehension)

myWrite.writelines("{}\n".format(x) for x in my_list)

Solution 3

'write(arg)' method expects string as its argument. So once it calls, it will directly writes. this is the reason it is much faster. where as if you are using writelines() method, it expects list of string as iterator. so even if you are sending data to writelines, it assumes that it got iterator and it tries to iterate over it. so since it is an iterator it will take some time to iterate over and write it.

Is that clear ?

Share:
17,231
Arjun Balgovind
Author by

Arjun Balgovind

Updated on June 15, 2022

Comments

  • Arjun Balgovind
    Arjun Balgovind almost 2 years

    I was working on a script which reading a folder of files(each of size ranging from 20 MB to 100 MB), modifies some data in each line, and writes back to a copy of the file.

    with open(inputPath, 'r+') as myRead:
         my_list = myRead.readlines()
         new_my_list = clean_data(my_list)
    with open(outPath, 'w+') as myWrite:
         tempT = time.time()
         myWrite.writelines('\n'.join(new_my_list) + '\n')
         print(time.time() - tempT)
    print(inputPath, 'Cleaning Complete.')
    

    On running this code with a 90 MB file (~900,000 lines), it printed 140 seconds as the time taken to write to the file. Here I used writelines(). So I searched for different ways to improve file writing speed, and in most of the articles that I read, it said write() and writelines() should not show any difference since I am writing a single concatenated string. I also checked the time taken for only the following statement:

    new_string = '\n'.join(new_my_list) + '\n'
    

    And it took only 0.4 seconds, so the large time taken was not because of creating the list. Just to try out write() I tried this code:

    with open(inputPath, 'r+') as myRead:
         my_list = myRead.readlines()
         new_my_list = clean_data(my_list)
    with open(outPath, 'w+') as myWrite:
         tempT = time.time()
         myWrite.write('\n'.join(new_my_list) + '\n')
         print(time.time() - tempT)
    print(inputPath, 'Cleaning Complete.')
    

    And it printed 2.5 seconds. Why is there such a large difference in the file writing time for write() and writelines() even though it is the same data? Is this normal behaviour or is there something wrong in my code? The output file seems to be the same for both cases, so I know that there is no loss in data.

  • Arjun Balgovind
    Arjun Balgovind almost 7 years
    But its still a single string isn't it? It will iterate over 1 value? How will that affect write speed?
  • mgilson
    mgilson almost 7 years
    Yeah, you might want to suggest something like myWrite.writelines(['\n'.join(my_list) + '\n'])
  • Jean-François Fabre
    Jean-François Fabre almost 7 years
    myWrite.writelines('\n'.join(my_list) + '\n') could just be myWrite.writelines("{}\n".format(x) for x in my_list) so that would be even faster; no list to build.
  • Martijn Pieters
    Martijn Pieters almost 7 years
    @Jean-FrançoisFabre: which is why I state to pass in a list or tuple or other iterable. :-)
  • Martijn Pieters
    Martijn Pieters almost 7 years
    @ArjunBalgovind: a single string is an iterable of separate characters.
  • Martijn Pieters
    Martijn Pieters almost 7 years
    @Jean-FrançoisFabre: it may just be a memory-saving measure however, as the buffer still concatenates those lines until it is full. It would help if clean_data() was a generator.
  • Arjun Balgovind
    Arjun Balgovind almost 7 years
    @mgilson myWrite.writelines(['\n'.join(my_list) + '\n']) worked just as good, as myWrite.write(). I understand now why writelines was so slow.
  • Arjun Balgovind
    Arjun Balgovind almost 7 years
    Thanks @MartijnPieters I think I've got a much better understanding of what python considers as iterables now. As of now my clean_data takes a list of all the rows from the input file, makes changes to each row, and returns a list of modified rows. Would it be more efficient to clean each row and write it immediately, or collect the rows into a list and write them all together as I am currently doing in my code?
  • Martijn Pieters
    Martijn Pieters almost 7 years
    @ArjunBalgovind: it'd be more memory efficient to clean each row as you read it, then use yield to pass on the result to the next step. Memory efficiency can translate into overall performance improvement if the file is large enough (as memory allocations take time too, and you want to avoid memory contention), and I/O slowness smoothes over the performance difference for small files.
  • Arjun Balgovind
    Arjun Balgovind almost 7 years
    So I shouldn't use readlines? So you suggest I change my script to read a line, clean it, and write it to the new file, and repeat this for each line in the input, is that correct?
  • Martijn Pieters
    Martijn Pieters almost 7 years
    @ArjunBalgovind: you can iterate directly over the file object, and efficiently read the file line by line. .readlines() reads the whole file into memory, but if you don't need random access to any given line to do your data cleaning job, that's entirely overkill and a waste of memory.
  • Martijn Pieters
    Martijn Pieters almost 7 years
    @ArjunBalgovind: and by iterating directly over the file, cleaning a single line at a time, then writing it out to the output file, you achieve the memory benefits I mentioned, yes. This is going to be efficient, because both reading and writing uses buffers (provided you don't process things one character at a time).
  • Arjun Balgovind
    Arjun Balgovind almost 7 years
    Thanks a lot for all this help :D
  • BlackJack
    BlackJack almost 7 years
    If the cleaning of a single line doesn't need knowledge from previous lines it's also possible to write a function that cleans a single line and use map(): out_file.writelines(map(clean_line, in_file)). (Assuming clean_line() includes the trailing '\n' in its result.)