reading multiple files contained in a zip file with pandas

26,488

Solution 1

You can pass ZipFile.open() to pandas.read_csv() to construct a pandas.DataFrame from a csv-file packed into a multi-file zip.

Code:

pd.read_csv(zip_file.open('file3.txt'))

Example to read all .csv into a dict:

from zipfile import ZipFile

zip_file = ZipFile('textfile.zip')
dfs = {text_file.filename: pd.read_csv(zip_file.open(text_file.filename))
       for text_file in zip_file.infolist()
       if text_file.filename.endswith('.csv')}

Solution 2

The most simplest way to handle this (if you have multiple parts of one big csv file compressed to a one zip file).

import pandas as pd
from zipfile import ZipFile

df = pd.concat(
    [pd.read_csv(ZipFile('some.zip').open(i)) for i in ZipFile('some.zip').namelist()],
    ignore_index=True
)

Solution 3

I had a similar problem with XML files awhile ago. The zipfile module can get you there.

from zipfile import ZipFile

z = ZipFile(yourfile)

text_files = z.infolist()

for text_file in text_files:
    z.read(text_file.filename)

If you want to concatenate them into a pandas object then it might get a bit more complex, but that should get you started. Note that the read method returns bytes, so you may have to handle that as well.

Share:
26,488

Related videos on Youtube

johnnyb
Author by

johnnyb

Updated on March 17, 2021

Comments

  • johnnyb
    johnnyb about 3 years

    I have multiple zip files containing different types of txt files. Like below:

    zip1 
      - file1.txt
      - file2.txt
      - file3.txt
    

    How can I use pandas to read in each of those files without extracting them?

    I know if they were 1 file per zip I could use the compression method with read_csv like below:

    df = pd.read_csv(textfile.zip, compression='zip') 
    

    Any help on how to do this would be great.

    • MaxU - stop genocide of UA
      MaxU - stop genocide of UA almost 7 years
      AFAIK it's not possible without extracting them...