Fastest way to write huge data in file

69,825

Solution 1

Two major reasons for observed "slowness":

  • your while loop is slow, it has about a million iterations.
  • You do not make proper use of I/O buffering. Do not make so many system calls. Currently, you are calling write() about one million times.

Create your data in a Python data structure first and call write() only once.

This is faster:

t0 = time.time()
open("bla.txt", "wb").write(''.join(random.choice(string.ascii_lowercase) for i in xrange(10**7)))
d = time.time() - t0
print "duration: %.2f s." % d

Output: duration: 7.30 s.

Now the program spends most of its time generating the data, i.e. in random stuff. You can easily see that by replacing random.choice(string.ascii_lowercase) with e.g. "a". Then the measured time drops to below one second on my machine.

And if you want to get even closer to seeing how fast your machine really is when writing to disk, use Python's fastest (?) way to generate largish data before writing it to disk:

>>> t0=time.time(); chunk="a"*10**7; open("bla.txt", "wb").write(chunk); d=time.time()-t0; print "duration: %.2f s." % d
duration: 0.02 s.

Solution 2

You literally create billions of objects which you then quickly throw away. In this case, it's probably better to write the strings directly into the file instead of concatenating them with ''.join().

Solution 3

The while loop under main calls generate_alphanumeric, which chooses several characters out of (fresh randomly generated) strings composed of twelve ascii letters and twelve numbers. That's basically the same as choosing randomly either a random letter or a random number twelve times. That's your main bottleneck. This version will make your code one order of magnitude faster:

def generate_alphanumeric(self):
    res = ''
    for i in range(12):
        if random.randrange(2):
            res += random.choice(string.ascii_lowercase)
        else:
            res += random.choice(string.digits)
    return res

I'm sure it can be improved upon. I suggest you take your profiler for a spin.

Share:
69,825

Related videos on Youtube

ajknzhol
Author by

ajknzhol

My name is Ajay Kumar N. I am ajkumar25 at Github and @ajkumar25 at twitter. You can also reach me via email. I try to contribute to society by striving to create great software products that make people's lives easier. I believe software is the most effective way to touch others' lives in our day and time. I mostly work in Python, I do not pigeonhole myself to specific languages or frameworks. A good developer is receptive and has the ability to learn new technologies. I also often contribute to open source projects and beta test startup products. I'm passionate about making people's lives better through software. Whether it's a small piece of functionality implemented in a way that is seamless to the user, or it's a large scale effort to improve the performance and usability of software, I'm there. That's what I do. I make software. Better

Updated on December 10, 2020

Comments

  • ajknzhol
    ajknzhol over 3 years

    I am trying to create a random real, integers, alphanumeric, alpha strings and then writing to a file till the file size reaches 10MB.

    The code is as follows.

    import string
    import random
    import time
    import sys
    
    
    class Generator():
        def __init__(self):
            self.generate_alphabetical_strings()
            self.generate_integers()
            self.generate_alphanumeric()
            self.generate_real_numbers()
    
        def generate_alphabetical_strings(self):
            return ''.join(random.choice(string.ascii_lowercase) for i in range(12))
    
        def generate_integers(self):
            return ''.join(random.choice(string.digits) for i in range(12))
    
        def generate_alphanumeric(self):
            return ''.join(random.choice(self.generate_alphabetical_strings() +
                                         self.generate_integers()) for i in range(12))
    
        def _insert_dot(self, string, index):
            return string[:index].__add__('.').__add__(string[index:])
    
    
        def generate_real_numbers(self):
            rand_int_string = ''.join(random.choice(self.generate_integers()) for i in range(12))
            return self._insert_dot(rand_int_string, random.randint(0, 11))
    
    
    from time import process_time
    import os
    
    a = Generator()
    
    t = process_time()
    inp = open("test.txt", "w")
    lt = 10 * 1000 * 1000
    count = 0
    while count <= lt:
        inp.write(a.generate_alphanumeric())
        count += 39
    inp.close()
    
    elapsed_time = process_time() - t
    print(elapsed_time)
    

    It takes around 225.953125 seconds to complete. How can i improve the speed of this program? Please provide some code insights?

    • dm03514
      dm03514 over 9 years
      where is the time being spent in your program?
    • ajknzhol
      ajknzhol over 9 years
      @MartijnPieters I tried the same code in Java and it took ~0.93 seconds.
    • ajknzhol
      ajknzhol over 9 years
      The Java program wrote to the disk. I checked the size of the file manually after completion of the process.
  • ajknzhol
    ajknzhol over 9 years
    What do you mean proper utilization of IO buffering?
  • Dr. Jan-Philip Gehrcke
    Dr. Jan-Philip Gehrcke over 9 years
    You are writing to disk. Writing to disk is a complex physical and logical process. It involves a lot of mechanics and control. It is much faster to tell the disk "Here, this is 10 MB of data, write it!" than telling it millions of times "Here, this is 1 byte of data, write it!". Therefore, the operating system has a mechanism to "collect" data that a process wants to write to disk before actually saving it to disk. However, if you explicitly tell the operating system to write small portions, then it does it. You are doing so, and this is slow. See my edit.
  • Dr. Jan-Philip Gehrcke
    Dr. Jan-Philip Gehrcke over 9 years
    No, this is not the main bottleneck. I agree that his way of generating the data is not optimal, but no no no, this is not the bottleneck of this program. His bottleneck is inefficient I/O.
  • debiatan
    debiatan over 9 years
    The original run time (on my machine) is 0m28.587s. My version takes 0m2.266s. Which other change would you make that had a greater impact?
  • Dr. Jan-Philip Gehrcke
    Dr. Jan-Philip Gehrcke over 9 years
    Remove the while loop, invoke write() only once.
  • debiatan
    debiatan over 9 years
    The majority of the time is already spent in unnecessary random generation. If I'm taking away 92% of the original run-time, how is that not the bottleneck? Once you solve that problem, I'm sure your suggestion will come in handy.
  • Dr. Jan-Philip Gehrcke
    Dr. Jan-Philip Gehrcke over 9 years
    Okay, let us agree on this: his data generation is really inefficient, and his I/O code is really inefficient. Which one of both is the major bottleneck depends on the system (on my SSD-powered system with a good CPU it is the I/O).
  • debiatan
    debiatan over 9 years
    My benchmark has been run on a SSD system, too. And I still don't see how that reduces the cost of a O(n^2) routine down to O(n).
  • Dr. Jan-Philip Gehrcke
    Dr. Jan-Philip Gehrcke over 9 years
    Sorry, I just found that I was all the time considering his generate_alphabetical_strings() method which is not as bad (see my answer below). Indeed, when he uses generate_alphanumeric(), this is his main bottleneck.
  • Aaron Digulla
    Aaron Digulla over 9 years
    @Jan-PhilipGehrcke: Is there a way to create a buffered file writer?
  • Dr. Jan-Philip Gehrcke
    Dr. Jan-Philip Gehrcke over 9 years
    @AaronDigulla: if you do not specify the buffering parameter upon calling Python's open(), a (small) buffer is usually applied, according to the "system default". The size of this buffer is not documented. For some version of glibc, here someone determined it to be 8 kB: stackoverflow.com/a/18194856/145400. For certain applications it makes sense to increase that buffer size with the buffering parameter. No general statement possible, but benchmarks help. Sometimes it makes sense to explicitly collect data in memory first, via docs.python.org/2/library/stringio.html
  • Rishi
    Rishi over 7 years
    Agree with the point in this answer. One important thing to note here is buffering parameter. For a smaller data set (hundreds or thousand items and total data is in KBs ) there is no significant difference in performance in either of the way. In my analysis, the time taken by calling write every time for each iteration came out be same (calculated to millisec precision) as the time taken for single write call. I have buffering set to -1 (which is default OS block size).