Open a csv.gz file in Python and print first 100 rows

40,695

Solution 1

Pretty much what you've already done, except read_csv also has nrows where you can specify the number of rows you want from the data set.

Additionally, to prevent the errors you were getting, you can set error_bad_lines to False. You'll still get warnings (if that bothers you, set warn_bad_lines to False as well). These are there to indicate inconsistency in how your dataset is filled out.

import pandas as pd
data = pd.read_csv('google-us-data.csv.gz', nrows=100, compression='gzip',
                   error_bad_lines=False)
print(data)

You can easily do something similar with the csv built-in library, but it'll require a for loop to iterate over the data, has shown in other examples.

Solution 2

I think you could do something like this (from the gzip module examples)

import gzip
with gzip.open('/home/joe/file.txt.gz', 'rb') as f:
    header = f.readline()
    # Read lines any way you want now. 

Solution 3

The first answer you linked suggests using gzip.GzipFile - this gives you a file-like object that decompresses for you on the fly.

Now you just need some way to parse csv data out of a file-like object ... like csv.reader.

The csv.reader object will give you a list of fieldnames, so you know the columns, their names, and how many there are.

Then you need to get the first 100 csv row objects, which will work exactly like in the second question you linked, and each of those 100 objects will be a list of fields.

So far this is all covered in your linked questions, apart from knowing about the existence of the csv module, which is listed in the library index.

Solution 4

Your code is OK;

pandas read_csv

warn_bad_lines : boolean, default True

If error_bad_lines is False, and warn_bad_lines is True, 
a warning for each “bad line” will be output. (Only valid with C parser).
Share:
40,695

Related videos on Youtube

SizzyNini
Author by

SizzyNini

Updated on July 09, 2022

Comments

  • SizzyNini
    SizzyNini almost 2 years

    I'm trying to get only the first 100 rows of a csv.gz file that has over 4 million rows in Python. I also want information on the # of columns and the headers of each. How can I do this?

    I looked at python: read lines from compressed text files to figure out how to open the file but I'm struggling to figure out how to actually print the first 100 rows and get some metadata on the information in the columns.

    I found this Read first N lines of a file in python but not sure how to marry this to opening the csv.gz file and reading it without saving an uncompressed csv file.

    I have written this code:

    import gzip
    import csv
    import json
    import pandas as pd
    
    
    df = pd.read_csv('google-us-data.csv.gz', compression='gzip', header=0,    sep=' ', quotechar='"', error_bad_lines=False)
    for i in range (100):
    print df.next() 
    

    I'm new to Python and I don't understand the results. I'm sure my code is wrong and I've been trying to debug it but I don't know which documentation to look at.

    I get these results (and it keeps going down the console - this is an excerpt):

    Skipping line 63: expected 3 fields, saw 7
    Skipping line 64: expected 3 fields, saw 7
    Skipping line 65: expected 3 fields, saw 7
    Skipping line 66: expected 3 fields, saw 7
    Skipping line 67: expected 3 fields, saw 7
    Skipping line 68: expected 3 fields, saw 7
    Skipping line 69: expected 3 fields, saw 7
    Skipping line 70: expected 3 fields, saw 7
    Skipping line 71: expected 3 fields, saw 7
    Skipping line 72: expected 3 fields, saw 7
    
    • CAB
      CAB over 7 years
      You will get help much faster if you know how to ask. What code have you written and how has it failed you?
    • SizzyNini
      SizzyNini over 7 years
      Ok I updated my post. Ideas?
    • moustachio
      moustachio over 7 years
      Can you post a sample of what the raw file looks like? (e.g. try head filename in a terminal)
    • Padraic Cunningham
      Padraic Cunningham over 7 years
      Pandas is using the metadata as the columns. You need to ignore lines up to the line that contains the column names