Read large file in parallel?

python performance multiprocessing

28,061

Solution 1

There was a blog post series "Wide Finder Project" several years ago about this at Tim Bray's site [1]. You can find there a solution [2] by Fredrik Lundh of ElementTree [3] and PIL [4] fame. I know posting links is generally discouraged at this site but I think these links give you better answer than copy-pasting his code.

[1] http://www.tbray.org/ongoing/When/200x/2007/10/30/WF-Results
[2] http://effbot.org/zone/wide-finder.htm
[3] http://docs.python.org/3/library/xml.etree.elementtree.html
[4] http://www.pythonware.com/products/pil/

Solution 2

It may be possible to parallelize this to speed it up, but doing multiple reads in parallel is unlikely to help.

Your OS is unlikely to usefully do multiple reads in parallel (the exception is with something like a striped raid array, in which case you still need to know the stride to make optimal use of it).

What you can do, is run the relatively expensive string/dictionary/list operations in parallel to the read.

So, one thread reads and pushes (large) chunks to a synchronized queue, one or more consumer threads pulls chunks from the queue, split them into lines, and populate the dictionary.

(If you go for multiple consumer threads, as Pappnese says, build one dictionary per thread and then join them).

Hints:

... push chunks to a synchronized queue ...
... one or more consumer threads ...

Re. bounty:

C obviously doesn't have the GIL to contend with, so multiple consumers are likely to scale better. The read behaviour doesn't change though. The down side is that C lacks built-in support for hash maps (assuming you still want a Python-style dictionary) and synchronized queues, so you have to either find suitable components or write your own. The basic strategy of multiple consumers each building their own dictionary and then merging them at the end is still likely the best.

Using strtok_r instead of str.split may be faster, but remember you'll need to manage the memory for all your strings manually too. Oh, and you need logic to manage line fragments too. Honestly C gives you so many options I think you'll just need to profile it and see.

Solution 3

It does seem tempting to think that using a processing pool will solve problems like this, but it's going to end up being a good bit more complicated than that, at least in pure Python.

Because the OP mentioned that the lists on each input line would be longer in practice than two elements, I made a slightly-more-realistic input file using :

paste <(seq 20000000) <(seq 2 20000001) <(seq 3 20000002) |
  head -1000000 > largefile.txt

After profiling the original code, I found the slowest portion of the process to be the line-splitting routine. (.split() took approximately 2x longer than .append() on my machine.)

1000000    0.333    0.000    0.333    0.000 {method 'split' of 'str' objects}
1000000    0.154    0.000    0.154    0.000 {method 'append' of 'list' objects}

So I factored the split into another function and use a pool to distribute the work of splitting the fields :

import sys
import collections
import multiprocessing as mp

d = collections.defaultdict(list)

def split(l):
    return l.split()

pool = mp.Pool(processes=4)
for keys in pool.map(split, open(sys.argv[1])):
    d[keys[0]].append(keys[1:])

Unfortunately, adding the pool slowed things down by more than 2x. The original version looked like this :

$ time python process.py smallfile.txt 
real    0m7.170s
user    0m6.884s
sys     0m0.260s

versus the parallel version :

$ time python process-mp.py smallfile.txt 
real    0m16.655s
user    0m24.688s
sys     0m1.380s

Because the .map() call basically has to serialize (pickle) each input, send it to the remote process, and then deserialize (unpickle) the return value from the remote process, using a pool in this way is much slower. You do get some improvement by adding more cores to the pool, but I'd argue that this is fundamentally the wrong way to distribute this work.

To really speed this up across cores, my guess is that you'd need to read in large chunks of the input using some sort of fixed block size. Then you could send the entire block to a worker process and get serialized lists back (though it's still unknown how much the deserialization here will cost you). Reading the input in fixed-size blocks sounds like it might be tricky with the anticipated input, however, since my guess is that each line isn't necessarily the same length.

Solution 4

One thing you could try is to get the line count from the file, then spawn 8 threads that makes a dictionary from 1/8th of the file each, then join the dictionaries when all threads are finished. This will probably speed it up if it is the appending that takes time and not the reading of the lines.

Solution 5

More cardinal solution for slow dictionary appending: replace the dictionary with array of pairs of strings. Fill it and then sort.

View more solutions

28,061

graffe

Updated on July 09, 2022

Comments

graffe almost 2 years
I have a large file which I need to read in and make a dictionary from. I would like this to be as fast as possible. However my code in python is too slow. Here is a minimal example that shows the problem.

First make some fake data
```
paste <(seq 20000000) <(seq 2 20000001)  > largefile.txt
```
Now here is a minimal piece of python code to read it in and make a dictionary.
```
import sys
from collections import defaultdict
fin = open(sys.argv[1])

dict = defaultdict(list)

for line in fin:
    parts = line.split()
    dict[parts[0]].append(parts[1])
```
Timings:
```
time ./read.py largefile.txt
real    0m55.746s
```
However it is possible to read the whole file much faster as:
```
time cut -f1 largefile.txt > /dev/null    
real    0m1.702s
```
My CPU has 8 cores, is it possible to parallelize this program in python to speed it up?

One possibility might be to read in large chunks of the input and then run 8 processes in parallel on different non-overlapping subchunks making dictionaries in parallel from the data in memory then read in another large chunk. Is this possible in python using multiprocessing somehow?

Update. The fake data was not very good as it had only one value per key. Better is
```
perl -E 'say int rand 1e7, $", int rand 1e4 for 1 .. 1e7' > largefile.txt
```
(Related to Read in large file and make dictionary .)
- sihrc almost 11 years
  
  the list is probably slowing you down.
- graffe almost 11 years
  
  @sihrc Most of the time is spent in dict[parts[0]].append(parts[1]) according to the profiling but append does take around 10% of the time. Note that in this fake case lists only have length 1 but in general I expect them to be longer.
- sihrc almost 11 years
  
  That's what I mean. That .append part is slowing you down. Have you thought of alternatives to appending large lists? What kind of data are you appending?
- Bakuriu almost 11 years
  
  You cannot parallelize I/O with only one disk. If you have much RAM you can read the whole file into memory with readlines() and process them directly in RAM.
- sihrc almost 11 years
  
  Yes, but reading the file isn't the slow part. it's the appending lists in the dictionary, which can be parallelized
- eri almost 11 years
  
  what if replace split by regular expression?
Lauritz V. Thaulow almost 11 years

Since this is a CPU-bound problem, threading isn't going to help much on CPython. The normal solution is to use multiprocessing instead, but that does not allow sharing a python dict amongst processes. Therefore there must be a single process with a dict that has the task of adding to it, and other processes for reading and parsing that send along their results using a queue or pipe. I don't think you'll be able to achieve a speedup of more than 3x using this method, because of the overhead.
graffe almost 11 years

@LauritzV.Thaulow What about loading the whole thing into memory then building 8 dicts and then merging them? Does that sound sensible?
Useless almost 11 years

I've given a couple of hints to get you started. Write some code, and ask if you have trouble with it.
graffe almost 11 years

Are multiple threads going to work? I am worried about the GIL.
Useless almost 11 years

The producer thread is I/O bound, so if the GIL is released while it thread blocks on the read system call, one consumer thread can make progress computing hashes and updating its dictionary. Multiple consumer threads are very likely to contend for the GIL though.
Lauritz V. Thaulow almost 11 years

@Anush Well, the dicts have to be serialized and rebuilt when moved from one process to another, and since the rebuilding will require almost as much work as the building, it defeats much of the purpose. Then the rebuilt dicts must be merged, and here my python knowledge is lacking, because I don't know if d1.update(d2) can do some shortcut that makes it much faster than looping over d2 and adding key-value pairs one by one. In any case I doubt it will help much. I understand IronPython does not have the CPU-bound threading problem, so if you can use that then Useless' answer should work.
graffe almost 11 years

I am surprised the dictionary creation wasn't the slowest part for you as it is in my profiling. About the line length I think I wasn't clear. Each line will not be exactly the same length but will split into the same number of parts. My new fake data example is more realistic that I just added.
eri almost 11 years

pool.map(split,open(sys.argv[1]),10000) will group lines by 10000-length chunks, but execution time not differs then non-chunked version.
CpILL about 5 years

This answer to a similar question might help you: stackoverflow.com/a/43079667/196732 I'm not sure about the joining then result together at the other end...?
DanielTuzes over 3 years

2 links out of 4 are dead. Now it would make sense to have the content copied here too.
CodeNoob about 3 years

@DanielTuzes agree, for others, you can always check web.archive.org, for these cases: web.archive.org/web/20201109032649/http://effbot.org/zone/…, and web.archive.org/web/20201121102218/http://www.pythonware.com‌/…