PyPDF2 write doesn't work on some PDF files (Python 3.5.1)

python python-3.x pdf reportlab pypdf2

17,619

Solution 1

On pdf.py do the following changes:

on line 1633 of pdf. py (which means uncommenting the if self.strict)

    if self.strict:
        raise utils.PdfReadError("Could not find object.")

and on line 501 on pdf.py make the following changes (adding a try, except block)

    try:
        obj.writeToStream(stream, key)
        stream.write(b_("\nendobj\n"))
    except:
        pass

Cheers.

Solution 2

Using "strict = false" got things working for me.

from PyPDF2 import PdfFileMerger

pdfs = [r'file 1.pdf', r'file 2.pdf']

merger = PdfFileMerger(strict=False)

for pdf in pdfs:
    merger.append(pdf)

merger.write(r"thanks mate.pdf")

17,619

Author by

Max Eisert

Updated on June 17, 2022

Comments

Max Eisert almost 2 years

First of all I am using Python 3.5.1 (32 bit version) I wrote the following program to add a pagenumber on all pages of my pdf files using PyPDF2 and reportlab:

#import modules
from os import listdir
from PyPDF2 import PdfFileWriter, PdfFileReader
import io
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import A4
#initial values of variable declarations
PDFlist=[]
X_value=460
Y_value=820
#Make a list of al files in de directory
filelist = listdir()
#Make a list of all pdf files in the directory
for i in range(0,len(filelist)):
    filename=filelist[i]
    for j in range(0,len(filename)):
        char=filename[j]
        if char=='.':
            extension=filename[j+1:j+4]
            if extension=='pdf':
                PDFlist.append(filename)
        j=j+1
    i=i+1
# Give the horizontal position for the page number (Enter = use default value of 480)
User = input('Give horizontal position page number (ENTER = default 460): ')
if User != "":
    X_value=int(User)
# Give the vertical position for the page number (Enter = use default value of 820)
User = input('Give horizontal position page number (ENTER = default 820): ')
if User != "":
    Y_value=int(User)

for i in range(0,len(PDFlist)):
    filename=PDFlist[i]

    # read the PDF
    existing_pdf = PdfFileReader(open(filename, "rb"))
    print("File: "+filename)
    # count the number of pages
    number_of_pages = existing_pdf.getNumPages()
    print("Number of pages detected:"+str(number_of_pages))
    output = PdfFileWriter()

    for k in range(0,number_of_pages):
        packet = io.BytesIO()

        # create a new PDF with Reportlab
        can = canvas.Canvas(packet, pagesize=A4)
        Pagenumber=" Page "+str(k+1)+"/"+str(number_of_pages)
        # we first make a white rectangle to cover any existing text in the pdf
        can.setFillColorRGB(1,1,1)
        can.setStrokeColorRGB(1,1,1)
        can.rect(X_value-10,Y_value-5,120,20,fill=1)
        # set the font and size
        can.setFont("Helvetica",14)
        # choose color of page numbers (red)
        can.setFillColorRGB(1,0,0)
        can.drawString(X_value, Y_value, Pagenumber)
        can.save()
        print(Pagenumber)

        #move to the beginning of the StringIO buffer
        packet.seek(0)
        new_pdf = PdfFileReader(packet)
        # add the "watermark" (which is the new pdf) on the existing page
        page = existing_pdf.getPage(k)
        page.mergePage(new_pdf.getPage(0))
        output.addPage(page)
        k=k+1
    # finally, write "output" to a real file

    ResultPDF="Output/"+filename
    outputStream = open(ResultPDF, "wb")
    output.write(outputStream)
    outputStream.close()
    i=i+1

This program works fine for quite a number of PDF files (albeit that warnings are sometimes generated like 'PdfReadWarning: Superfluous whitespace found in object header b'16' b'0' [pdf.py:1666]' but the resulting output file is okay to me). However, the program just doesn't work on some PDF files although these files are perfectly readable and editable with my Adobe Acrobat. I have the impression the error pops up mostly on PDF files that were scanned but not on all of them (I also numbered scanned PDF files that didn't generate any error). I am getting the following error message (the first 8 lines are the result of my own print commands):

File: Scanned file.pdf
Number of pages detected:6
 Page 1/6
 Page 2/6
 Page 3/6
 Page 4/6
 Page 5/6
 Page 6/6
PdfReadWarning: Object 25 1 not defined. [pdf.py:1629]
Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\Sourcecode\PDFPager.py", line 83, in <module>
    output.write(outputStream)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 482, in write
    self._sweepIndirectReferences(externalReferenceMap, self._root)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
    self._sweepIndirectReferences(externMap, realdata)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, value)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
    value = self._sweepIndirectReferences(externMap, data[i])
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 577, in _sweepIndirectReferences
    newobj = data.pdf.getObject(data)
  File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 1631, in getObject
    raise utils.PdfReadError("Could not find object.")
PyPDF2.utils.PdfReadError: Could not find object.

Apparently the pages are merged with the PDF created by reportlab (see lines up to page 6/6) but in the end no output PDF file can be generated by PyPDF2 (I get an unreadible output file of 0 bytes). Can somebody shed some light on how to resolve this? I searched the internet but couldn't really find an answer.

Ninga almost 5 years

Hey yes I just reran with it set to True and the doc was sill created, just with a bunch of warnings. I thought it fixed an issue with the new doc not being created however my issue must have been different.
GoingMyWay almost 5 years

I think before merging files, first check if the files are broken. Then merge them. If files are broken or they are not fully downloaded, merging will not succed.
Shaohua Li over 4 years

Cool. This fix should definitely be pushed into master. However seems pypdf2 is unmaintained now :(
Watusimoto over 4 years

This same fix fixes the same problem on pypdf4; I posted a link to this topic on the thread for the relevant bug there. pypdf4 seems less inactive than pypdf2.
bmg over 4 years

@Watusimoto thanks for letting me know! I added a comment below it. Let's hope the repo owner notices it.
mwakerman almost 4 years

@bmg - I also posted this question on the associated Github issue, feel free to respond here or there and I'll X-post. We're looking to incorporate your workaround to get around this issue but are not sure about the consequences. It looks like an error is simply being ignored and an intentional, one would assume, conditional being uncommitted. Do you have an understanding of why_ this fixes the issue and if it will result in content being removed from a document?
Sudhik over 3 years

I'm trying to not merge whole pdf files but some pages. I still get the error with strict=False. Modifying pdf.py with said changes work. So, pdf.py never got corrected ?
bmg over 2 years

@mwakerman looks like someone deleted the PyPDF3 repo... If you have the content of the issue, can you open that issue in PyPDF4 and put it here too?
SahFra98 about 2 years

Does this involve limitation when the file pdf is then read with PdfFileReader? I've filled authomatically a form, then I have to download it and to fill other fields manually. When I have then to read it with PdfFileReader I've some problems because it seems not recognizing more the fields.
bmg about 2 years

@SahFra98 I have stopped using PyPDF versions altogether because of this problem and the codebase is completely abandoned by its developer. I have switched over to PikePDF and would recommend doing so. If you just need to merge a few things you can check out my repository here for reference: github.com/gonultasbu/pdf_merge.