Read random lines from huge CSV file

28,245

Solution 1

import random

filesize = 1500                 #size of the really big file
offset = random.randrange(filesize)

f = open('really_big_file')
f.seek(offset)                  #go to random position
f.readline()                    # discard - bound to be partial line
random_line = f.readline()      # bingo!

# extra to handle last/first line edge cases
if len(random_line) == 0:       # we have hit the end
    f.seek(0)
    random_line = f.readline()  # so we'll grab the first line instead

As @AndreBoos pointed out, this approach will lead to biased selection. If you know min and max length of line you can remove this bias by doing the following:

Let's assume (in this case) we have min=3 and max=15

1) Find the length (Lp) of the previous line.

Then if Lp = 3, the line is most biased against. Hence we should take it 100% of the time If Lp = 15, the line is most biased towards. We should only take it 20% of the time as it is 5* more likely selected.

We accomplish this by randomly keeping the line X% of the time where:

X = min / Lp

If we don't keep the line, we do another random pick until our dice roll comes good. :-)

Solution 2

I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it

Assuming you don't need exactly 1 million lines and know then number of lines in your CSV file beforehand, you can use reservoir sampling to retrieve your random subset. Simply iterate through your data and for each line determine the chances of the line being selected. That way you only need a single pass of your data.

This works well if you need to extract the random samples often but the actual dataset changes infrequently (since you'll only need to keep track of the number of entries each time the dataset changes).

chances_selected = desired_num_results / total_entries
for line in csv.reader(file):
   if random() < chances_selected:
        result.append(line)

Solution 3

You can use a variation of the probabilistic method for choosing a random line in a file.

Instead of just keeping a single number that gets chosen, you can keep a buffer of size C. For each line number, n, in the file with N lines, you want to choose that line with probability C/n (rather than the original 1/n. If the number is selected, you then choose a random location from the C-length buffer to evict.

Here's how it works:

import random

C = 2
fpath = 'somelines.txt'
buffer = []

f = open(fpath, 'r')
for line_num, line in enumerate(f):
    n = line_num + 1.0
    r = random.random()
    if n <= C:
        buffer.append(line.strip())
    elif r < C/n:
        loc = random.randint(0, C-1)
        buffer[loc] = line.strip()

This requires a single pass through the file (so it's linear time) and returns exactly C lines from the file. Each line will have probability C/N of being selected.

To verify that the above works, I created a file with 5 lines containing a,b,c,d,e. I ran the code 10,000 times with C=2. This should produce about an even distribution of the 5 choose 2 (so 10) possible choices. The results:

a,b: 1046
b,c: 1018
b,e: 1014
a,c: 1003
c,d: 1002
d,e: 1000
c,e: 993
a,e: 992
a,d: 985
b,d: 947

Solution 4

If you want to grab random lines many times (e.g., mini-batches for machine learning), and you don't mind scanning through the huge file once (without loading it into memory), then you can create a list of line indeces and use seek to quickly grab the lines (based off of Maria Zverina's answer).

# Overhead:
# Read the line locations into memory once.  (If the lines are long,
# this should take substantially less memory than the file itself.)
fname = 'big_file'
s = [0]
linelocs = [s.append(s[0]+len(n)) or s.pop(0) for n in open(fname)]
f = open(fname) # Reopen the file.

# Each subsequent iteration uses only the code below:
# Grab a 1,000,000 line sample
# I sorted these because I assume the seeks are faster that way.
chosen = sorted(random.sample(linelocs, 1000000))
sampleLines = []
for offset in chosen:
  f.seek(offset)
  sampleLines.append(f.readline())
# Now we can randomize if need be.
random.shuffle(sampleLines)

Solution 5

If the lines are truly .csv format and NOT fixed field, then no, there's not. You can crawl through the file once, indexing the byte offsets for each line, then when later needed only use the index set, but there's no way to a priori predict the exact location of the line-terminating \n character for arbitrary csv files.

Share:
28,245
jbssm
Author by

jbssm

I'm an Astrophysicist researcher in the field of solar atmosphere and heliospherical particle propagation.

Updated on October 31, 2021

Comments

  • jbssm
    jbssm over 2 years

    I have this quite big CSV file (15 Gb) and I need to read about 1 million random lines from it. As far as I can see - and implement - the CSV utility in Python only allows to iterate sequentially in the file.

    It's very memory consuming to read the all file into memory to use some random choosing and it's very time consuming to go trough all the file and discard some values and choose others, so is there any way to choose some random line from the CSV file and read only that line?

    I tried without success:

    import csv
    
    with open('linear_e_LAN2A_F_0_435keV.csv') as file:
        reader = csv.reader(file)
        print reader[someRandomInteger]
    

    A sample of the CSV file:

    331.093,329.735
    251.188,249.994
    374.468,373.782
    295.643,295.159
    83.9058,0
    380.709,116.221
    352.238,351.891
    183.809,182.615
    257.277,201.302
    61.4598,40.7106
    
  • Andrew Buss
    Andrew Buss almost 12 years
    Clever, but this will provide biased results on files with variable-length lines.
  • parselmouth
    parselmouth almost 12 years
    because the first readline will almost always land in the middle of a line.
  • Andrew Buss
    Andrew Buss almost 12 years
    @thg435: That does not fix anything. A line after a long line will be disproportionately represented compared to one after a short line. Furthermore, the first line will never be read.
  • Maria Zverina
    Maria Zverina almost 12 years
    Actually AndreBoss is correct :) ... really long lines are more likely to get selected than really short line ... but if you sampling 15GB file ... maybe you are willing to accept this bias?
  • Andrew Buss
    Andrew Buss almost 12 years
    I was afraid that this was the case, unless we know more about the possible values of each line.
  • jbssm
    jbssm almost 12 years
    @Maria thanks, this works, but like Andre says, it has some statistical bias towards big lines.
  • Stephen Paulger
    Stephen Paulger almost 12 years
    +1 for fixing the last line bug before I could post a comment ;)
  • Maria Zverina
    Maria Zverina almost 12 years
    Have added extra few lines so that first line will be also selected and you won't go ever the end of file. The critique of biased selection still holds - if you absolutely need unbiased selection than you'll need to rewrite the file in fixed record format.
  • georg
    georg almost 12 years
    Yes, reservoir sampling. But how do they find total_entries?
  • Shawn Chin
    Shawn Chin almost 12 years
    @thg435 hence the statement "... and know the number of lines in your CSV file". Such a scheme works if you're doing the sampling often and you need only count the dataset size once.
  • Shawn Chin
    Shawn Chin almost 12 years
    @thg435 And thanks for the proper term. I couldn't for the life of me recall it.
  • georg
    georg almost 12 years
    @AndreBoos: Looking at the sample OP provided, I don't think there will be issues. In any case, seek is the only way to achieve what they want in one pass.
  • georg
    georg almost 12 years
    aha, didn't notice that. BTW, if all lines are of similar length, you can estimate total_entries = filesize / length_of_first_line
  • Andrew Buss
    Andrew Buss almost 12 years
    thg435: The sampling will remain biased, although it may not be a problem, depending on jbssm's purposes. Two passes is not an unreasonable approach, however.
  • Andrew Buss
    Andrew Buss almost 12 years
    Using for line in CSV result will load the file into memory anyway, which should be avoided.
  • Maria Zverina
    Maria Zverina almost 12 years
    I believe that you can do unbiased selection if you know minimum length of the line - as discussed above. But the code will be left as exercise the the reader :-)
  • jbssm
    jbssm almost 12 years
    But that way I still have to read the all file up to the line I actually want to read.
  • Thomas K
    Thomas K almost 12 years
    @jbssm: You have to run through the entire file, yes, but you don't have to load it all into memory.
  • Shawn Chin
    Shawn Chin almost 12 years
    Will work, but with the additional memory overheads of a million numbers. And you'll still need to make a pass through the data file.
  • Thomas K
    Thomas K almost 12 years
    Yep. That's about 30MB as a set on my computer, so it's probably not a deal breaker.
  • jbssm
    jbssm almost 12 years
    I understand your point and it looks valid, but the conversion routine you gave me doesn't work. Looking at my file, all values have 7 chars (counting the .), except for the 0 values. So I actually only need to transform the 0 into 0.00000 and it should work.
  • jbssm
    jbssm almost 12 years
    @MariaZverina I think that to make an unbiased selection you need to know not only the minimum length of the line, but also the frequency of each line length in the file. Anyway, in the file, the only values that have less than 7 characters are the 0. I will try to convert my file to have 0 replaced by 0.00000 and then your routine should work perfectly.
  • Andrew Buss
    Andrew Buss almost 12 years
    Oh, I see; that makes sense. Do you have any control over the program that outputs that data? It's probably possible to change the output format to something more regular.
  • jbssm
    jbssm almost 12 years
    I do, but it's written in C++ and after much searching I realized there is no way to do that with printf for every value. So I will have to do the conversion in python probably.
  • jbssm
    jbssm almost 12 years
    Hi, thanks. But that way you have to read the all file before and that takes a lot for this huge file.
  • jbssm
    jbssm almost 12 years
    You still have to iterate over each line this way to get to a line your random variable tells you to read. After thinking a bit more about it, the only way possible without iterating trough the all file, is by using Maria Zverina method and making sure all line have and equal number of chars.
  • Mark Ransom
    Mark Ransom almost 12 years
    Congratulations for figuring out how to do it in one pass.
  • Mark Ransom
    Mark Ransom almost 12 years
    @jbssm, if each line has an equal number of characters this becomes trivial - just multiply the random line number times the line size, and seek to that point in the file.
  • jbssm
    jbssm almost 12 years
    Yes, I will write some file conversion program and then use that way then. Thanks.
  • Mark Ransom
    Mark Ransom almost 12 years
    Use random.sample to create lines_to_grab.
  • jterrace
    jterrace almost 12 years
    @jbssm You have to read through the entire file at least once to get an unbiased result.
  • parselmouth
    parselmouth almost 12 years
    @MariaZverina, agree with jbssm's comment. To correct for the bias, you would need to know the distribution of the line lengths, then normalize, accordingly. Your normalization technique works if the distribution of line lengths 3 ... 15 is uniform, but to KNOW that this distribution is in fact uniform, we would need to read the whole file. Which puts us back to the same starting square that if we're going to read the file at least once, why not write it back out into something more amenable to fast manipulation in the future (e.g. in fixed field or as an sqlite3 table, etc.)
  • Thomas K
    Thomas K almost 12 years
    @MarkRansom: Thanks, I've added that detail.
  • Maria Zverina
    Maria Zverina almost 12 years
    @parselmouth I beg to differ ... it will work regardless of uniformity. The length of the preceding line gives the probability that the line itself will be selected and it Lp/F (F = file length). Hence if you normalise the chance of each line getting selected to min/F you're done. Consider extreme case of 1 line with 3 chars and all remaining lines are 15. Then with my scheme each line has 3/F chance of getting selected. Line lengths are not uniformly distributed yet each line has equal chance of being selected. As only 20% of advantaged lines will be selected.
  • Maria Zverina
    Maria Zverina almost 12 years
    It should be doable try somethine like: "%09.4f" % (1/3.0) .... That will give your four fixed decimal points ... and will not wrap for any number under 10000.0
  • Maria Zverina
    Maria Zverina almost 12 years
    @jbssm also consider exponential format: "%.4e" % (1/3.0) this will give you five significant digits and fixed width of 10 chars. Unless you need to format negative numbers as well .... in which case use the + modifier for printf. And reading exponential formated floats is just as easy as basic floats in python.
  • parselmouth
    parselmouth almost 12 years
    @MariaZverina ...Ok...I "did the math", and the probabilities do work out to be uniform using your second step filter. I stand corrected. (takes his hat off and bows to you)
  • jbssm
    jbssm almost 12 years
    Than you Drew, it seems the most advanced solution but unfortunately I won't be the only one using these numbers to do science and I'm really certain most of the other people will have no idea what is an sql database, much less how to use it.
  • Causality
    Causality over 11 years
    @ShawnChin: there is no "additional memory overheads of a million numbers". You never need to store it in memory. It is I/O the file twice, which should not be a problem in many cases.
  • Causality
    Causality over 11 years
    @ThomasK, the code should be random.sample(xrange(N), K), to sample K numbers (no replicates) in range 0-(N-1).
  • Shawn Chin
    Shawn Chin over 11 years
    @causality you'd need to store lines_to_grab in memory. How else would you do if i in lines_to_grab?
  • Thomas K
    Thomas K over 11 years
    @Causality: Well spotted, I've fixed the brackets.
  • Maria Zverina
    Maria Zverina almost 11 years
    Stanislav Yaglo (@stanislav-yaglo) came up with an elegant solution that will work when number of lines to be selected is relatively small. You can achieve a random unbiased line by repeatedly seeking to a random point in the file until you hit carriage return. Then simply read off the following line. :)
  • Walker Hale IV
    Walker Hale IV over 8 years
    Note that seeking the file in random order will thrash disk (not as much of a problem with SSD storage, but sill unfriendly to some storage backends). When collecting many sample lines, it makes sense to collect batches of random offsets and sort the batches before fetching the lines. That way the backend storage sees a more linear progression of access. For a million lines, most machines will easily have enough memory to do this with just one batch of sorted random offsets.
  • Marctar
    Marctar about 8 years
    Just noticed that this is the code for what @parselmouth already described.
  • ygesher
    ygesher about 7 years
    Based on your answer, I built a little package for getting reading arbitrary lines in a file. Check it out here
  • Marctar
    Marctar about 7 years
    Cool. I actually needed this again, so I'll use you package. Thanks!
  • ygesher
    ygesher about 7 years
    since my last comment I made a pip package out of it: random-access-file-reader
  • Jonathan Leffler
    Jonathan Leffler about 6 years
    Python depends on indentation; your code in the innermost loop (for line in reader) is a Python indentation error. It isn't clear whether Num.remove(Num[0]) should be indented or not — the a.writerow(line) must be indented, as must the break.
  • iouvxz
    iouvxz about 3 years
    @AndrewBuss There is ,check out my answer stackoverflow.com/questions/10819911/…