Whitespace gone from PDF extraction, and strange word interpretation

19,233

Solution 1

Your PDF file doesn't have printable space characters, it simply positions the words where they need to go. You'll have to do extra work to figure out the spaces, perhaps by assuming multi-character runs are words, and put spaces between them.

If you can select text in the PDF reader, and have spaces appear properly, then at least you know there is enough information to reconstruct the text.

"fi" is a typographic ligature, shown as a single character. You may find this is also happening with "fl", "ffi", and "ffl". You can use string replacement to substitute "fi" for the fi ligature.

Solution 2

Without using the PyPdf2 use Pdfminer library package which has same functionality, as bellow. I got the code from this and as i wanted I edited it, this code gives me a text file which has white-space among words. I work with anaconda and python 3.6. for install PdfMiner for python 3.6 you can use this link.

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from io import StringIO

class PdfConverter:

   def __init__(self, file_path):
       self.file_path = file_path
# convert pdf file to a string which has space among words 
   def convert_pdf_to_txt(self):
       rsrcmgr = PDFResourceManager()
       retstr = StringIO()
       codec = 'utf-8'  # 'utf16','utf-8'
       laparams = LAParams()
       device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
       fp = open(self.file_path, 'rb')
       interpreter = PDFPageInterpreter(rsrcmgr, device)
       password = ""
       maxpages = 0
       caching = True
       pagenos = set()
       for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password, caching=caching, check_extractable=True):
           interpreter.process_page(page)
       fp.close()
       device.close()
       str = retstr.getvalue()
       retstr.close()
       return str
# convert pdf file text to string and save as a text_pdf.txt file
   def save_convert_pdf_to_txt(self):
       content = self.convert_pdf_to_txt()
       txt_pdf = open('text_pdf.txt', 'wb')
       txt_pdf.write(content.encode('utf-8'))
       txt_pdf.close()
if __name__ == '__main__':
    pdfConverter = PdfConverter(file_path='sample.pdf')
    print(pdfConverter.convert_pdf_to_txt())

Solution 3

As an alternative to PyPDF2, I suggest pdftotext:

#!/usr/bin/env python

"""Use pdftotext to extract text from PDFs."""

import pdftotext

with open("foobar.pdf") as f:
    pdf = pdftotext.PDF(f)

# Iterate over all the pages
for page in pdf:
    print(page)

Solution 4

PyPDF doesnt read newline charecter.

So use PyPDF4

Install it using

pip install PyPDF4

and use this code as an example

import PyPDF4
import re
import io

pdfFileObj = open(r'3134.pdf', 'rb')
pdfReader = PyPDF4.PdfFileReader(pdfFileObj)
pageObj = pdfReader.getPage(1)
pages_text = pageObj.extractText()

for line in pages_text.split('\n'):
    #if re.match(r"^PDF", line):
    print(line)

Solution 5

I tried given answers here but it did not work for me. the following works in my case:

from pdf2image import convert_from_path
import pytesseract

images = convert_from_path("sample.pdf")
for i,image in enumerate(images,start=1):
    image.save(f"./images/page_{i}.jpg","JPEG")

print(pytesseract.image_to_string("./images/page_1.jpg"))

The idea here is to first convert the PDF to an image and then read the text from it. This approach preserves the whitespace.

Dependecies:

  • conda install -c conda-forge tesseract
  • conda install pdf2image
  • conda install pytesseract
Share:
19,233
Louis Thibault
Author by

Louis Thibault

Updated on June 14, 2022

Comments

  • Louis Thibault
    Louis Thibault almost 2 years

    Using the snippet below, I've attempted to extract the text data from this PDF file.

    import pyPdf
    
    def get_text(path):
        # Load PDF into pyPDF
        pdf = pyPdf.PdfFileReader(file(path, "rb"))
        # Iterate pages
        content = ""
        for i in range(0, pdf.getNumPages()):
            content += pdf.getPage(i).extractText() + "\n"  # Extract text from page and add to content
        # Collapse whitespace
        content = " ".join(content.replace(u"\xa0", " ").strip().split())
        return content
    

    The output I obtain, however,is devoid of whitespace between most of the words. This makes it difficult to perform natural language processing on the text (my ultimate goal, here).

    Also, the 'fi' in the word 'finger' is consistently interpreted as something else. This is rather problematic since this paper is about spontaneous finger movements...

    Does anybody know why this might be happening? I don't even know where to start!