PyPDF2 write doesn't work on some PDF files (Python 3.5.1)

17,619

Solution 1

On pdf.py do the following changes:

on line 1633 of pdf. py (which means uncommenting the if self.strict)

    if self.strict:
        raise utils.PdfReadError("Could not find object.")

and on line 501 on pdf.py make the following changes (adding a try, except block)

    try:
        obj.writeToStream(stream, key)
        stream.write(b_("\nendobj\n"))
    except:
        pass

Cheers.

Solution 2

Using "strict = false" got things working for me.

from PyPDF2 import PdfFileMerger

pdfs = [r'file 1.pdf', r'file 2.pdf']

merger = PdfFileMerger(strict=False)

for pdf in pdfs:
    merger.append(pdf)

merger.write(r"thanks mate.pdf")
Share:
17,619
Max Eisert
Author by

Max Eisert

Updated on June 17, 2022

Comments

  • Max Eisert
    Max Eisert almost 2 years

    First of all I am using Python 3.5.1 (32 bit version) I wrote the following program to add a pagenumber on all pages of my pdf files using PyPDF2 and reportlab:

    #import modules
    from os import listdir
    from PyPDF2 import PdfFileWriter, PdfFileReader
    import io
    from reportlab.pdfgen import canvas
    from reportlab.lib.pagesizes import A4
    #initial values of variable declarations
    PDFlist=[]
    X_value=460
    Y_value=820
    #Make a list of al files in de directory
    filelist = listdir()
    #Make a list of all pdf files in the directory
    for i in range(0,len(filelist)):
        filename=filelist[i]
        for j in range(0,len(filename)):
            char=filename[j]
            if char=='.':
                extension=filename[j+1:j+4]
                if extension=='pdf':
                    PDFlist.append(filename)
            j=j+1
        i=i+1
    # Give the horizontal position for the page number (Enter = use default value of 480)
    User = input('Give horizontal position page number (ENTER = default 460): ')
    if User != "":
        X_value=int(User)
    # Give the vertical position for the page number (Enter = use default value of 820)
    User = input('Give horizontal position page number (ENTER = default 820): ')
    if User != "":
        Y_value=int(User)
    
    for i in range(0,len(PDFlist)):
        filename=PDFlist[i]
    
        # read the PDF
        existing_pdf = PdfFileReader(open(filename, "rb"))
        print("File: "+filename)
        # count the number of pages
        number_of_pages = existing_pdf.getNumPages()
        print("Number of pages detected:"+str(number_of_pages))
        output = PdfFileWriter()
    
        for k in range(0,number_of_pages):
            packet = io.BytesIO()
    
            # create a new PDF with Reportlab
            can = canvas.Canvas(packet, pagesize=A4)
            Pagenumber=" Page "+str(k+1)+"/"+str(number_of_pages)
            # we first make a white rectangle to cover any existing text in the pdf
            can.setFillColorRGB(1,1,1)
            can.setStrokeColorRGB(1,1,1)
            can.rect(X_value-10,Y_value-5,120,20,fill=1)
            # set the font and size
            can.setFont("Helvetica",14)
            # choose color of page numbers (red)
            can.setFillColorRGB(1,0,0)
            can.drawString(X_value, Y_value, Pagenumber)
            can.save()
            print(Pagenumber)
    
            #move to the beginning of the StringIO buffer
            packet.seek(0)
            new_pdf = PdfFileReader(packet)
            # add the "watermark" (which is the new pdf) on the existing page
            page = existing_pdf.getPage(k)
            page.mergePage(new_pdf.getPage(0))
            output.addPage(page)
            k=k+1
        # finally, write "output" to a real file
    
        ResultPDF="Output/"+filename
        outputStream = open(ResultPDF, "wb")
        output.write(outputStream)
        outputStream.close()
        i=i+1
    

    This program works fine for quite a number of PDF files (albeit that warnings are sometimes generated like 'PdfReadWarning: Superfluous whitespace found in object header b'16' b'0' [pdf.py:1666]' but the resulting output file is okay to me). However, the program just doesn't work on some PDF files although these files are perfectly readable and editable with my Adobe Acrobat. I have the impression the error pops up mostly on PDF files that were scanned but not on all of them (I also numbered scanned PDF files that didn't generate any error). I am getting the following error message (the first 8 lines are the result of my own print commands):

    File: Scanned file.pdf
    Number of pages detected:6
     Page 1/6
     Page 2/6
     Page 3/6
     Page 4/6
     Page 5/6
     Page 6/6
    PdfReadWarning: Object 25 1 not defined. [pdf.py:1629]
    Traceback (most recent call last):
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\Sourcecode\PDFPager.py", line 83, in <module>
        output.write(outputStream)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 482, in write
        self._sweepIndirectReferences(externalReferenceMap, self._root)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
        self._sweepIndirectReferences(externMap, realdata)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
        value = self._sweepIndirectReferences(externMap, value)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
        self._sweepIndirectReferences(externMap, realdata)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
        value = self._sweepIndirectReferences(externMap, value)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
        value = self._sweepIndirectReferences(externMap, data[i])
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences
        self._sweepIndirectReferences(externMap, realdata)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences
        value = self._sweepIndirectReferences(externMap, value)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences
        value = self._sweepIndirectReferences(externMap, data[i])
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 577, in _sweepIndirectReferences
        newobj = data.pdf.getObject(data)
      File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 1631, in getObject
        raise utils.PdfReadError("Could not find object.")
    PyPDF2.utils.PdfReadError: Could not find object.
    

    Apparently the pages are merged with the PDF created by reportlab (see lines up to page 6/6) but in the end no output PDF file can be generated by PyPDF2 (I get an unreadible output file of 0 bytes). Can somebody shed some light on how to resolve this? I searched the internet but couldn't really find an answer.

  • Ninga
    Ninga almost 5 years
    Hey yes I just reran with it set to True and the doc was sill created, just with a bunch of warnings. I thought it fixed an issue with the new doc not being created however my issue must have been different.
  • GoingMyWay
    GoingMyWay almost 5 years
    I think before merging files, first check if the files are broken. Then merge them. If files are broken or they are not fully downloaded, merging will not succed.
  • Shaohua Li
    Shaohua Li over 4 years
    Cool. This fix should definitely be pushed into master. However seems pypdf2 is unmaintained now :(
  • Watusimoto
    Watusimoto over 4 years
    This same fix fixes the same problem on pypdf4; I posted a link to this topic on the thread for the relevant bug there. pypdf4 seems less inactive than pypdf2.
  • bmg
    bmg over 4 years
    @Watusimoto thanks for letting me know! I added a comment below it. Let's hope the repo owner notices it.
  • mwakerman
    mwakerman almost 4 years
    @bmg - I also posted this question on the associated Github issue, feel free to respond here or there and I'll X-post. We're looking to incorporate your workaround to get around this issue but are not sure about the consequences. It looks like an error is simply being ignored and an intentional, one would assume, conditional being uncommitted. Do you have an understanding of why_ this fixes the issue and if it will result in content being removed from a document?
  • Sudhik
    Sudhik over 3 years
    I'm trying to not merge whole pdf files but some pages. I still get the error with strict=False. Modifying pdf.py with said changes work. So, pdf.py never got corrected ?
  • bmg
    bmg over 2 years
    @mwakerman looks like someone deleted the PyPDF3 repo... If you have the content of the issue, can you open that issue in PyPDF4 and put it here too?
  • SahFra98
    SahFra98 about 2 years
    Does this involve limitation when the file pdf is then read with PdfFileReader? I've filled authomatically a form, then I have to download it and to fill other fields manually. When I have then to read it with PdfFileReader I've some problems because it seems not recognizing more the fields.
  • bmg
    bmg about 2 years
    @SahFra98 I have stopped using PyPDF versions altogether because of this problem and the codebase is completely abandoned by its developer. I have switched over to PikePDF and would recommend doing so. If you just need to merge a few things you can check out my repository here for reference: github.com/gonultasbu/pdf_merge.