Reading utf-8 characters from a gzip file in python

30,975

Solution 1

I don't see why this should be so hard.

What are you doing exactly? Please explain "eventually it reads an invalid character".

It should be as simple as:

import gzip
fp = gzip.open('foo.gz')
contents = fp.read() # contents now has the uncompressed bytes of foo.gz
fp.close()
u_str = contents.decode('utf-8') # u_str is now a unicode string

EDITED

This answer works for Python2 in Python3, please see @SeppoEnarvi 's answer at https://stackoverflow.com/a/19794943/610569 (it uses the rt mode for gzip.open.

Solution 2

This is possible since Python 3.3:

import gzip
gzip.open('file.gz', 'rt', encoding='utf-8')

Notice that gzip.open() requires you to explicitly specify text mode ('t').

Solution 3

Maybe

import codecs
zf = gzip.open(fname, 'rb')
reader = codecs.getreader("utf-8")
contents = reader( zf )
for line in contents:
    pass

Solution 4

The above produced tons of decoding errors. I used this:

for line in io.TextIOWrapper(io.BufferedReader(gzip.open(filePath)), encoding='utf8', errors='ignore'):
    ...
Share:
30,975
Juan Besa
Author by

Juan Besa

I am computer scientist with 9 years experience researching algorithms in both industrial and academic settings. I am passionate about beautiful algorithms and crafting the code that makes them work in real life. During this time I have learned to elegantly solve hard technical problems. After creating a C++ computational geometry library for CATS I left Chile and moved to California to pursue a PhD at University of California, Irvine. My PhD focuses on deepening our knowledge about algorithms in the area of graph drawing, graphs and sorting.

Updated on June 30, 2021

Comments

  • Juan Besa
    Juan Besa almost 3 years

    I am trying to read a gunzipped file (.gz) in python and am having some trouble.

    I used the gzip module to read it but the file is encoded as a utf-8 text file so eventually it reads an invalid character and crashes.

    Does anyone know how to read gzip files encoded as utf-8 files? I know that there's a codecs module that can help but I can't understand how to use it.

    Thanks!

    import string
    import gzip
    import codecs
    
    f = gzip.open('file.gz','r')
    
    engines = {}
    line = f.readline()
    while line:
        parsed = string.split(line, u'\u0001')
    
        #do some things...
    
        line = f.readline()
    for en in engines:
      print(en)