Search and replace for text within a pdf, in Python
Solution 1
This can be done with PyPDF2 package. The implementation may depend on the original PDF template structure. But if the template is stable enough and isn't changed very often the replacement code shouldn't be generic but rather simple.
I did a small sketch on how you could replace the text inside a PDF file. It replaces all occurrences of PDF
tokens to DOC
.
import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject
def replace_text(content, replacements = dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = line
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(object, replacements):
data = object.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if object.decodedSelf is not None:
object.decodedSelf.setData(encoded_data)
else:
object.setData(encoded_data)
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to PDF document")
args = vars(ap.parse_args())
in_file = args["input"]
filename_base = in_file.replace(os.path.splitext(in_file)[1], "")
# Provide replacements list that you need here
replacements = { 'PDF': 'DOC'}
pdf = PdfFileReader(in_file)
writer = PdfFileWriter()
for page_number in range(0, pdf.getNumPages()):
page = pdf.getPage(page_number)
contents = page.getContents()
if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
process_data(contents, replacements)
elif len(contents) > 0:
for obj in contents:
if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
streamObj = obj.getObject()
process_data(streamObj, replacements)
writer.addPage(page)
with open(filename_base + ".result.pdf", 'wb') as out_file:
writer.write(out_file)
The results are
UPDATE 2021-03-21:
Updated the code example to handle DecodedStreamObject
and EncodedStreamObject
which actually contian data stream with text to update.
Solution 2
If @Dmytrio solution do not alter final PDF
Dymitrio's updated code example to handle DecodedStreamObject and EncodedStreamObject which actually contain data stream with text to update could run fine, but with a file different from example, was not able to alter pdf text content.
According to EDIT 3, from How to replace text in a PDF using Python?:
By inserting page[NameObject("/Contents")] = contents.decodedSelf
before writer.addPage(page)
, we force pyPDF2 to update content of the page object.
This way I was able to overcome this problem and replace text from pdf file.
Final code should look like this:
import os
import argparse
from PyPDF2 import PdfFileReader, PdfFileWriter
from PyPDF2.generic import DecodedStreamObject, EncodedStreamObject, NameObject
def replace_text(content, replacements = dict()):
lines = content.splitlines()
result = ""
in_text = False
for line in lines:
if line == "BT":
in_text = True
elif line == "ET":
in_text = False
elif in_text:
cmd = line[-2:]
if cmd.lower() == 'tj':
replaced_line = line
for k, v in replacements.items():
replaced_line = replaced_line.replace(k, v)
result += replaced_line + "\n"
else:
result += line + "\n"
continue
result += line + "\n"
return result
def process_data(object, replacements):
data = object.getData()
decoded_data = data.decode('utf-8')
replaced_data = replace_text(decoded_data, replacements)
encoded_data = replaced_data.encode('utf-8')
if object.decodedSelf is not None:
object.decodedSelf.setData(encoded_data)
else:
object.setData(encoded_data)
if __name__ == "__main__":
ap = argparse.ArgumentParser()
ap.add_argument("-i", "--input", required=True, help="path to PDF document")
args = vars(ap.parse_args())
in_file = args["input"]
filename_base = in_file.replace(os.path.splitext(in_file)[1], "")
# Provide replacements list that you need here
replacements = { 'PDF': 'DOC'}
pdf = PdfFileReader(in_file)
writer = PdfFileWriter()
for page_number in range(0, pdf.getNumPages()):
page = pdf.getPage(page_number)
contents = page.getContents()
if isinstance(contents, DecodedStreamObject) or isinstance(contents, EncodedStreamObject):
process_data(contents, replacements)
elif len(contents) > 0:
for obj in contents:
if isinstance(obj, DecodedStreamObject) or isinstance(obj, EncodedStreamObject):
streamObj = obj.getObject()
process_data(streamObj, replacements)
# Force content replacement
page[NameObject("/Contents")] = contents.decodedSelf
writer.addPage(page)
with open(filename_base + ".result.pdf", 'wb') as out_file:
writer.write(out_file)
Important: from PyPDF2.generic import NameObject
Solution 3
- Decompress the pdf to make parsing easier (solves many of the issues in the previous answer). I use pdftk. (If this step fails, one hack to pre-process the pdf is to open the pdf in OSX Preview, print it, and then choose save as pdf from the print menu. Then retry the command below.)
pdftk original.pdf output uncompressed.pdf uncompress
- Parse and replace using PyPDF2.
from PyPDF2 import PdfFileReader, PdfFileWriter
replacements = [
("old string", "new string")
]
pdf = PdfFileReader(open("uncompressed.pdf", "rb"))
writer = PdfFileWriter()
for page in pdf.pages:
contents = page.getContents().getData()
for (a,b) in replacements:
contents = contents.replace(a.encode('utf-8'), b.encode('utf-8'))
page.getContents().setData(contents)
writer.addPage(page)
with open("modified.pdf", "wb") as f:
writer.write(f)
- [Optional] Re-compress the pdf.
pdftk modified.pdf output recompressed.pdf compress
Related videos on Youtube
Phil Hunt
Updated on September 23, 2022Comments
-
Phil Hunt over 1 year
I am writing mailmerge software as part of a Python web app.
I have a template called
letter.pdf
which was generated from a MS Word file and includes the text {name} where the resident's name will go. I also have a list of c. 100 residents' names.What I want to do is to read in
letter.pdf
do a search for"{name}"
and replace it with the resident's name (for each resident) then write the result to another pdf. I then want to gather all these pdfs togetherinot a big pdf (one page per letter) which my web app's users will print out to create their letters.Are there any Python libraries that will do this? I've looked and pdfrw and pdfminer but I couldn't see where they would be able to do it.
(NB: I also have the MS Word file, so if there was another way of using that not going through a pdf, that would also do the job.)
-
Varad More over 3 yearsThis is working for sample file but I'm getting this error while working on a certificate.
data = object.getData() AttributeError: 'NameObject' object has no attribute 'getData'
any resolution to this? -
mattf over 3 yearsSame Issue!
AttributeError: 'NameObject' object has no attribute 'getData'
-
Dmytro over 3 yearsThis means that the PDF conent stream structure is different. Could you provide a link to the sample PDF that you're dealing with please. Then I could update the answer.
-
swisswiss over 3 yearsfor example this pdf downloaded from google docs. we.tl / t-pYzmky0R5B
-
Hafiz Siddiq over 3 years@Dmytro any solution please i am also getting the same issue
AttributeError: 'NameObject' object has no attribute 'getData'
-
Dmytro about 3 years@swisswiss, Sorry for not answering earlier. Could you please share the pdf doc again, cause the link has expired.
-
mrgou about 3 years@Dmytro Looks like any basic PDF file generated by GhostScript generates the error: gofile.io/d/qxJKOK
-
Dmytro about 3 years@mrgou I updated the code example to handle the data streams. Not sure if it works with all kinds of PDFs but at least processes the PDF you provided. The idea is basically to find either
DecodedStreamObject
orEncodedStreamObject
in the PDF pages and apply the replacement code to their contents. -
alias51 over 2 yearsResults in
PyPDF2.utils.PdfReadError: Creating EncodedStreamObject is not currently supported
-
alias51 over 2 yearsI have this problem, but it seams to be that
data.decode('utf-8')
does not decode to a text format? -
alias51 over 2 yearsThis solution doesn't work for PDFs created from Word. How do you create a simple PDF from a word doc that would be compliant?
-
Vladimir Simoes da Luz Junior over 2 yearsIt is possible that your PDF do not use utf-8 encoding. You might wanna test if
data.decode("ascii")
works for you. By the way if you live in Latin America (such as I do) you may want to trydata.decode("iso-8859-1")
. If this doesnt helps, you can try to brute force decoding by parsingdata.decode("utf-8", "ignore")
-
alias51 over 2 yearsI ran a
for
loop over every known standard and it didn't work. I can only assume that Acrobat encodes PDFs differently whenSave As
from Word is used? -
Vladimir Simoes da Luz Junior over 2 years@alias51, have you tried to
print(data = object.getData())
inside proces_data() ? If that does not give you the text content of the pdf, it is possible that your file has been password encrypted by Acrobat. You can get some reference on password decrypting here: github.com/mstamy2/PyPDF2/issues/378 ; github.com/atlanhq/camelot/issues/325 ; github.com/mstamy2/PyPDF2/issues/378#issuecomment-689585779