How to write a check in python to see if file is valid UTF-8?

15,619

Solution 1

You could do something like

import codecs
try:
    f = codecs.open(filename, encoding='utf-8', errors='strict')
    for line in f:
        pass
    print "Valid utf-8"
except UnicodeDecodeError:
    print "invalid utf-8"

Solution 2

def try_utf8(data):
    "Returns a Unicode object on success, or None on failure"
    try:
       return data.decode('utf-8')
    except UnicodeDecodeError:
       return None

data = f.read()
udata = try_utf8(data)
if udata is None:
    # Not UTF-8.  Do something else
else:
    # Handle unicode data
Share:
15,619
Jox
Author by

Jox

I just love writing code...

Updated on June 03, 2022

Comments

  • Jox
    Jox over 1 year

    As stated in title, I would like to check in given file object (opened as binary stream) is valid UTF-8 file.

    Anyone?

    Thanks

  • Jox
    Jox over 13 years
    Obviously I didn't do my homework good enough when there is more that one solution simple as this :( Thanks!
  • colidyre
    colidyre over 4 years
    Could be simpler by using only one line: codecs.open("path/to/file", encoding="utf-8", errors="strict").readlines() instead of 3.