PyPDF2 write doesn't work on some PDF files (Python 3.5.1)
Solution 1
On pdf.py do the following changes:
on line 1633 of pdf. py (which means uncommenting the if self.strict)
if self.strict:
raise utils.PdfReadError("Could not find object.")
and on line 501 on pdf.py make the following changes (adding a try, except block)
try:
obj.writeToStream(stream, key)
stream.write(b_("\nendobj\n"))
except:
pass
Cheers.
Solution 2
Using "strict = false" got things working for me.
from PyPDF2 import PdfFileMerger
pdfs = [r'file 1.pdf', r'file 2.pdf']
merger = PdfFileMerger(strict=False)
for pdf in pdfs:
merger.append(pdf)
merger.write(r"thanks mate.pdf")
Max Eisert
Updated on June 17, 2022Comments
-
Max Eisert almost 2 years
First of all I am using Python 3.5.1 (32 bit version) I wrote the following program to add a pagenumber on all pages of my pdf files using PyPDF2 and reportlab:
#import modules from os import listdir from PyPDF2 import PdfFileWriter, PdfFileReader import io from reportlab.pdfgen import canvas from reportlab.lib.pagesizes import A4 #initial values of variable declarations PDFlist=[] X_value=460 Y_value=820 #Make a list of al files in de directory filelist = listdir() #Make a list of all pdf files in the directory for i in range(0,len(filelist)): filename=filelist[i] for j in range(0,len(filename)): char=filename[j] if char=='.': extension=filename[j+1:j+4] if extension=='pdf': PDFlist.append(filename) j=j+1 i=i+1 # Give the horizontal position for the page number (Enter = use default value of 480) User = input('Give horizontal position page number (ENTER = default 460): ') if User != "": X_value=int(User) # Give the vertical position for the page number (Enter = use default value of 820) User = input('Give horizontal position page number (ENTER = default 820): ') if User != "": Y_value=int(User) for i in range(0,len(PDFlist)): filename=PDFlist[i] # read the PDF existing_pdf = PdfFileReader(open(filename, "rb")) print("File: "+filename) # count the number of pages number_of_pages = existing_pdf.getNumPages() print("Number of pages detected:"+str(number_of_pages)) output = PdfFileWriter() for k in range(0,number_of_pages): packet = io.BytesIO() # create a new PDF with Reportlab can = canvas.Canvas(packet, pagesize=A4) Pagenumber=" Page "+str(k+1)+"/"+str(number_of_pages) # we first make a white rectangle to cover any existing text in the pdf can.setFillColorRGB(1,1,1) can.setStrokeColorRGB(1,1,1) can.rect(X_value-10,Y_value-5,120,20,fill=1) # set the font and size can.setFont("Helvetica",14) # choose color of page numbers (red) can.setFillColorRGB(1,0,0) can.drawString(X_value, Y_value, Pagenumber) can.save() print(Pagenumber) #move to the beginning of the StringIO buffer packet.seek(0) new_pdf = PdfFileReader(packet) # add the "watermark" (which is the new pdf) on the existing page page = existing_pdf.getPage(k) page.mergePage(new_pdf.getPage(0)) output.addPage(page) k=k+1 # finally, write "output" to a real file ResultPDF="Output/"+filename outputStream = open(ResultPDF, "wb") output.write(outputStream) outputStream.close() i=i+1
This program works fine for quite a number of PDF files (albeit that warnings are sometimes generated like '
PdfReadWarning: Superfluous whitespace found in object header b'16' b'0' [pdf.py:1666]
' but the resulting output file is okay to me). However, the program just doesn't work on some PDF files although these files are perfectly readable and editable with my Adobe Acrobat. I have the impression the error pops up mostly on PDF files that were scanned but not on all of them (I also numbered scanned PDF files that didn't generate any error). I am getting the following error message (the first 8 lines are the result of my own print commands):File: Scanned file.pdf Number of pages detected:6 Page 1/6 Page 2/6 Page 3/6 Page 4/6 Page 5/6 Page 6/6 PdfReadWarning: Object 25 1 not defined. [pdf.py:1629] Traceback (most recent call last): File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\Sourcecode\PDFPager.py", line 83, in <module> output.write(outputStream) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 482, in write self._sweepIndirectReferences(externalReferenceMap, self._root) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, data[i]) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 571, in _sweepIndirectReferences self._sweepIndirectReferences(externMap, realdata) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 547, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, value) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 556, in _sweepIndirectReferences value = self._sweepIndirectReferences(externMap, data[i]) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 577, in _sweepIndirectReferences newobj = data.pdf.getObject(data) File "C:\Users\User\AppData\Local\Programs\Python\Python35-32\lib\site-packages\PyPDF2\pdf.py", line 1631, in getObject raise utils.PdfReadError("Could not find object.") PyPDF2.utils.PdfReadError: Could not find object.
Apparently the pages are merged with the PDF created by reportlab (see lines up to page 6/6) but in the end no output PDF file can be generated by PyPDF2 (I get an unreadible output file of 0 bytes). Can somebody shed some light on how to resolve this? I searched the internet but couldn't really find an answer.
-
Ninga almost 5 yearsHey yes I just reran with it set to True and the doc was sill created, just with a bunch of warnings. I thought it fixed an issue with the new doc not being created however my issue must have been different.
-
GoingMyWay almost 5 yearsI think before merging files, first check if the files are broken. Then merge them. If files are broken or they are not fully downloaded, merging will not succed.
-
Shaohua Li over 4 yearsCool. This fix should definitely be pushed into master. However seems pypdf2 is unmaintained now :(
-
Watusimoto over 4 yearsThis same fix fixes the same problem on pypdf4; I posted a link to this topic on the thread for the relevant bug there. pypdf4 seems less inactive than pypdf2.
-
bmg over 4 years@Watusimoto thanks for letting me know! I added a comment below it. Let's hope the repo owner notices it.
-
mwakerman almost 4 years@bmg - I also posted this question on the associated Github issue, feel free to respond here or there and I'll X-post. We're looking to incorporate your workaround to get around this issue but are not sure about the consequences. It looks like an error is simply being ignored and an intentional, one would assume, conditional being uncommitted. Do you have an understanding of why_ this fixes the issue and if it will result in content being removed from a document?
-
Sudhik over 3 yearsI'm trying to not merge whole pdf files but some pages. I still get the error with strict=False. Modifying pdf.py with said changes work. So, pdf.py never got corrected ?
-
bmg over 2 years@mwakerman looks like someone deleted the PyPDF3 repo... If you have the content of the issue, can you open that issue in PyPDF4 and put it here too?
-
SahFra98 about 2 yearsDoes this involve limitation when the file pdf is then read with PdfFileReader? I've filled authomatically a form, then I have to download it and to fill other fields manually. When I have then to read it with PdfFileReader I've some problems because it seems not recognizing more the fields.
-
bmg about 2 years@SahFra98 I have stopped using PyPDF versions altogether because of this problem and the codebase is completely abandoned by its developer. I have switched over to PikePDF and would recommend doing so. If you just need to merge a few things you can check out my repository here for reference: github.com/gonultasbu/pdf_merge.