How to obtain the total numbers of rows from a CSV file in Python?

292,420

Solution 1

You need to count the number of rows:

row_count = sum(1 for row in fileObject)  # fileObject is your csv.reader

Using sum() with a generator expression makes for an efficient counter, avoiding storing the whole file in memory.

If you already read 2 rows to start with, then you need to add those 2 rows to your total; rows that have already been read are not being counted.

Solution 2

2018-10-29 EDIT

Thank you for the comments.

I tested several kinds of code to get the number of lines in a csv file in terms of speed. The best method is below.

with open(filename) as f:
    sum(1 for line in f)

Here is the code tested.

import timeit
import csv
import pandas as pd

filename = './sample_submission.csv'

def talktime(filename, funcname, func):
    print(f"# {funcname}")
    t = timeit.timeit(f'{funcname}("{filename}")', setup=f'from __main__ import {funcname}', number = 100) / 100
    print('Elapsed time : ', t)
    print('n = ', func(filename))
    print('\n')

def sum1forline(filename):
    with open(filename) as f:
        return sum(1 for line in f)
talktime(filename, 'sum1forline', sum1forline)

def lenopenreadlines(filename):
    with open(filename) as f:
        return len(f.readlines())
talktime(filename, 'lenopenreadlines', lenopenreadlines)

def lenpd(filename):
    return len(pd.read_csv(filename)) + 1
talktime(filename, 'lenpd', lenpd)

def csvreaderfor(filename):
    cnt = 0
    with open(filename) as f:
        cr = csv.reader(f)
        for row in cr:
            cnt += 1
    return cnt
talktime(filename, 'csvreaderfor', csvreaderfor)

def openenum(filename):
    cnt = 0
    with open(filename) as f:
        for i, line in enumerate(f,1):
            cnt += 1
    return cnt
talktime(filename, 'openenum', openenum)

The result was below.

# sum1forline
Elapsed time :  0.6327946722068599
n =  2528244


# lenopenreadlines
Elapsed time :  0.655304473598555
n =  2528244


# lenpd
Elapsed time :  0.7561274056295324
n =  2528244


# csvreaderfor
Elapsed time :  1.5571560935772661
n =  2528244


# openenum
Elapsed time :  0.773000013928679
n =  2528244

In conclusion, sum(1 for line in f) is fastest. But there might not be significant difference from len(f.readlines()).

sample_submission.csv is 30.2MB and has 31 million characters.

Solution 3

To do it you need to have a bit of code like my example here:

file = open("Task1.csv")
numline = len(file.readlines())
print (numline)

I hope this helps everyone.

Solution 4

Several of the above suggestions count the number of LINES in the csv file. But some CSV files will contain quoted strings which themselves contain newline characters. MS CSV files usually delimit records with \r\n, but use \n alone within quoted strings.

For a file like this, counting lines of text (as delimited by newline) in the file will give too large a result. So for an accurate count you need to use csv.reader to read the records.

Solution 5

First you have to open the file with open

input_file = open("nameOfFile.csv","r+")

Then use the csv.reader for open the csv

reader_file = csv.reader(input_file)

At the last, you can take the number of row with the instruction 'len'

value = len(list(reader_file))

The total code is this:

input_file = open("nameOfFile.csv","r+")
reader_file = csv.reader(input_file)
value = len(list(reader_file))

Remember that if you want to reuse the csv file, you have to make a input_file.fseek(0), because when you use a list for the reader_file, it reads all file, and the pointer in the file change its position

Share:
292,420
GrantU
Author by

GrantU

Updated on March 21, 2022

Comments

  • GrantU
    GrantU about 2 years

    I'm using python (Django Framework) to read a CSV file. I pull just 2 lines out of this CSV as you can see. What I have been trying to do is store in a variable the total number of rows the CSV also.

    How can I get the total number of rows?

    file = object.myfilePath
    fileObject = csv.reader(file)
    for i in range(2):
        data.append(fileObject.next()) 
    

    I have tried:

    len(fileObject)
    fileObject.length
    
    • David Robinson
      David Robinson about 11 years
      What is file_read? Is it a file handle (as in file_read = open("myfile.txt")?
    • GrantU
      GrantU about 11 years
      file_read = csv.reader(file) updated question should make sense now.
    • shredding
      shredding about 11 years
      Have a look at this question for thoughts on that topic: stackoverflow.com/questions/845058/…
    • AjayKumarBasuthkar
      AjayKumarBasuthkar about 8 years
    • dancow
      dancow over 3 years
      The accepted answer by @martjin-pieters is correct, but this question is worded poorly. In your pseudocode, you almost certainly want to count the number of rows i.e. records – as opposed to "Count how many lines are in a CSV". Because some CSV datasets may include fields which may be multiline.
    • trpt4him
      trpt4him over 3 years
      Also the algorithm you use to count the number of records is going to depend on whether or not you are also parsing every record and doing something with each one. If so, just simply count while you're iterating instead of performing an entire "table scan" separately.
  • GrantU
    GrantU about 11 years
    Thanks. That will works, but do I have to read the lines first? That seems a bit of a hit?
  • Martijn Pieters
    Martijn Pieters about 11 years
    file_read is apparently a csv.reader() object, so it does not have a readlines() method. .readlines() has to create a potentially large list, which you then discard again.
  • Martijn Pieters
    Martijn Pieters about 11 years
    You have to read the lines; the lines are not guaranteed to be a fixed size, so the only way to count them is to read them all.
  • Alex Troush
    Alex Troush about 11 years
    When i write this answer, topic haven't information about csv is csv reader object.
  • Escachator
    Escachator about 9 years
    it's weird, because I have a file with more than 4.5 million rows, and this method only counts 53 rows...
  • Martijn Pieters
    Martijn Pieters about 9 years
    @Escachator: what platform are you on? Are there EOF (CTRL-Z, \x1A) characters in the file? How did you open the file?
  • Escachator
    Escachator about 9 years
    I am doing the following: file_read = csv.reader('filename') row_count = sum(1 for row in file_read) Don't think there are EOF in the file, just pure "," figures, and \n
  • Martijn Pieters
    Martijn Pieters about 9 years
    @Escachator: Your filename has 53 characters then. The reader takes an iterable or an open file object but not a filename.
  • Escachator
    Escachator about 9 years
    I see... let me fix that
  • Escachator
    Escachator about 9 years
    now it works, and it is super fast! I did: file_read = open('filename') row_count = sum(1 for row in file_read) big thanks!
  • lesolorzanov
    lesolorzanov over 6 years
    Should you also close the file? to save space?
  • gosuto
    gosuto over 6 years
    Why do you prefer sum() over len() in your conclusion? Len() is faster in your results!
  • Simon Lang
    Simon Lang over 6 years
    Nice answer. One addition. Although slower, one should prefer the for row in csv_reader: solution when the CSV is supposed to contain valid quoted newlines according to rfc4180. @dixhom how large was the file you've tested?
  • Pengju Zhao
    Pengju Zhao about 6 years
    I like this short answer, but it is slower than Martijn Pieters's. For 10M lines, %time sum(1 for row in open("df_data_raw.csv")) cost 4.91s while %time len(open("df_data_raw.csv").readlines()) cost 14.6s.
  • KevinTydlacka
    KevinTydlacka almost 6 years
    Note that if you want to then iterate through the reader again (to process the rows, say) then you'll need to reset the iterator, and recreate the reader object: file.seek(0) then fileObject = csv.reader(file)
  • Danilo Souza Morães
    Danilo Souza Morães over 5 years
    The first one is counting the number of lines in a file. If your csv has line breaks in strings, it wont show accurate results
  • Danilo Souza Morães
    Danilo Souza Morães over 5 years
    What if you have line breaks inside double quotes? That should still be considered part of the same record. This answer is wrong
  • dedricF
    dedricF about 4 years
    Just stumbling across stuff, seems this shape comment isn't so bad and actually comparatively very fast: stackoverflow.com/questions/15943769/…
  • dedricF
    dedricF about 4 years
    Oh but you'll want to do a data.shape[0]
  • Vitalis
    Vitalis almost 4 years
    This is very handy for integrating into a python script. +1
  • dancow
    dancow over 3 years
    But is it comparatively fast compared to @martijnpieters's answer, which uses a standard file handle/iterator, and doesn't require installing and importing the pandas library?
  • dancow
    dancow over 3 years
    The original title to the question ("Count how many lines are in a CSV Python") was worded confusingly/misleadingly, since the questioner seems to want the number of rows/records. Your answer would give a wrong number of rows in any dataset in which there are fields with newline characters
  • S3DEV
    S3DEV over 3 years
    Certainly a fastest solution. I'd recommend renaming the len variable as it's overwriting the built-in function.
  • S3DEV
    S3DEV over 3 years
    Nice one. sum1forline could be even faster if the file is opened as 'rb'.
  • pyjamas
    pyjamas over 3 years
    If you're reading it as a DataFrame you don't need a loop you can just do len(df)