'utf-8' codec can't decode byte 0xe2 : invalid continuation byte error

python pdf utf-8 decode

91,940

Solution 1

PDF files are stored as bytes. Therefore to read or write a PDF file you need to use rb or wb.

with open(file, 'rb') as fopen:
    q = fopen.read()
    print(q.decode())

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte might occur because of your editor or the PDF is not utf encoded(generally).

Therefore use ,

with open(file, 'rb') as fopen:
        q = fopen.read()
        print(q.decode('latin-1')) #or any encoding which is suitable here.

If your editor console is incompatible then also you wont be able to see any output.

A NOTE : you can't use encoding param while using rb so you have to decode after reading the file.

Solution 2

When you open a file with open(..., 'r', encoding='utf-8') you are basically guaranteeing that this is a text file containing no bytes which are not UTF-8. But of course, this guarantee cannot hold for a PDF file - it is a binary format which may or may not contain strings in UTF-8. But that's not how you read it.

If you have access to a library which reads PDF and extracts text strings, you could do

# Dunno if such a library exists, but bear with ...
instance = myFantasyPDFlibrary('file.pdf')
for text_snippet in instance.enumerate_texts_in_PDF():
    if 'API No.:\n' in text_snippet:
        api = text_snippet.split('API No.:\n')[1].split('\n')[0].split('"')[0].strip()

More realistically, but in a more pedestrian fashion, you could read the PDF file as a binary file, and look for the encoded text.

with open('file.pdf', 'rb') as pdf:
    pdfbytes = pdf.read()
if b'API No.:\n' in pdfbytes:
    api_text = pdfbytes.split(b'API No.:\n')[1].split(b'\n')[0].decode('utf-8')
    api = api_text.split('"')[0].strip()

A crude workaround is to lie to Python about the encoding, and claim that it's actually Latin-1. This particular encoding has the attractive feature that every byte maps exactly to its own Unicode code point, so you can read binary data as text and get away with it. But then, of course, any actual UTF-8 will be converted to mojibake (so "hëlló" will render as "hÃ«llÃ³" for example). You can extract actual UTF-8 text by converting the text back to bytes and then decoding it with the correct encoding (latintext.encode('latin-1').decode('utf-8')).

Solution 3

Just switch to a a different codec packag: encoding = 'unicode_escape'

91,940

Author by

Prat

Updated on September 08, 2021

Comments

Prat over 2 years

I am trying to read all PDF files from a folder to look for a number using regular expression. On inspection, the charset for PDFs is 'UTF-8'.

Throws this error:

'utf-8' codec can't decode byte 0xe2 in position 10: invalid continuation byte

Tried reading in binary mode, tried Latin-1 encoding, but it shows all special characters so nothing shows up in search.

import os
import re
import pandas as pd
download_file_path = "C:\\Users\\...\\..\\"
for file_name in os.listdir(download_file_path):
    try:
        with open(download_file_path + file_name, 'r',encoding="UTF-8") as f:
          s = f.read()
          re_api = re.compile("API No\.\:\n(.*)")
          api = re_api.search(s).group(1).split('"')[0].strip()
          print(api)
    except Exception as e:
        print(e)

Expecting to find API number from PDF files