Search and replace placeholder text in PDF with Python

18,278

Solution 1

There is no direct way to do this that will work reliably. PDFs are not like HTML: they specify the positioning of text character-by-character. They may not even include the whole font used to render the text, just the characters needed to render the specific text in the document. No library I've found will do nice things like re-wrap paragraphs after updating the text. PDFs are for the most part a display-only format, so you'll be much better off using a tool that turns markup into a PDF than updating the PDF in-place.

If that's not an option, you can create a PDF form in something like Acrobat, then use a PDF manipulation library like iText (AGPL) or pdfbox, which has a nice clojure wrapper called pdfboxing that can handle some of that.

From my experience, Python's support for writing to PDFs is pretty limited. Java has, by far, the best language support. Also, you get what you pay for, so it would probably be worth paying for a iText license if you're using this for commercial purposes. I've had pretty good results writing python wrappers around PDF-manipulation CLI tools like pdfboxing and ghostscript. That will probably be much easier for your use case than trying to shoehorn this into Python's PDF ecosystem.

Solution 2

There is no definite solution but I found 2 solutions that works most of the time.

In python https://github.com/JoshData/pdf-redactor gives good results. Here is the example code:

# Redact things that look like social security numbers, replacing the
# text with X's.
options.content_filters = [
        # First convert all dash-like characters to dashes.
        (
                re.compile(u"Tom Xavier"),
                lambda m : "XXXXXXX"
        ),

        # Then do an actual SSL regex.
        # See https://github.com/opendata/SSN-Redaction for why this regex is complicated.
        (
                re.compile(r"(?<!\d)(?!666|000|9\d{2})([OoIli0-9]{3})([\s-]?)(?!00)([OoIli0-9]{2})\2(?!0{4})([OoIli0-9]{4})(?!\d)"),
                lambda m : "XXX-XX-XXXX"
        ),
]

# Perform the redaction using PDF on standard input and writing to standard output.
pdf_redactor.redactor(options)

Full Example can be found here

In ruby https://github.com/gettalong/hexapdf works for black out text. Example code:

require 'hexapdf'

class ShowTextProcessor < HexaPDF::Content::Processor

  def initialize(page, to_hide_arr)
    super()
    @canvas = page.canvas(type: :overlay)
    @to_hide_arr = to_hide_arr
  end

  def show_text(str)
    boxes = decode_text_with_positioning(str)
    return if boxes.string.empty?
    if @to_hide_arr.include? boxes.string
        @canvas.stroke_color(0, 0 , 0)

        boxes.each do |box|
          x, y = *box.lower_left
          tx, ty = *box.upper_right
          @canvas.rectangle(x, y, tx - x, ty - y).fill
        end
    end

  end
  alias :show_text_with_positioning :show_text

end

file_name = ARGV[0]
strings_to_black = ARGV[1].split("|")

doc = HexaPDF::Document.open(file_name)
puts "Blacken strings [#{strings_to_black}], inside [#{file_name}]."
doc.pages.each.with_index do |page, index|
  processor = ShowTextProcessor.new(page, strings_to_black)
  page.process_contents(processor)
end

new_file_name = "#{file_name.split('.').first}_updated.pdf"
doc.write(new_file_name, optimize: true)

puts "Writing updated file [#{new_file_name}]."

In this you can black out text on select text will be visible.

Solution 3

As another solution you may try Aspose.PDF Cloud SDK for Python, it provides the feature to replace text in a PDF document.

First thing first, install the Aspose.PDF Cloud SDK for Python

pip install asposepdfcloud

Sample Code upload PDF file to your cloud storage and replace multiple strings in a PDF document

import os 
import asposepdfcloud 
from asposepdfcloud.apis.pdf_api import PdfApi 
 
# Get App key and App SID from https://aspose.cloud 
pdf_api_client = asposepdfcloud.api_client.ApiClient( 
    app_key='xxxxxxxxxxxxxxxxxxxxxxxxxxxxxx', 
    app_sid='xxxxx-xxxx-xxxx-xxxx-xxxxxxxx') 
 
pdf_api = PdfApi(pdf_api_client) 
filename = '02_pages.pdf' 
remote_name = '02_pages.pdf' 
 
#upload PDF file to storage 
pdf_api.upload_file(remote_name,filename) 
 
#Replace Text 
text_replace1 = asposepdfcloud.models.TextReplace(old_value='origami',new_value='aspose',regex='true') 
text_replace2 = asposepdfcloud.models.TextReplace(old_value='candy',new_value='biscuit',regex='true') 
text_replace_list = asposepdfcloud.models.TextReplaceListRequest(text_replaces=[text_replace1,text_replace2]) 
 
response = pdf_api.post_document_text_replace(remote_name, text_replace_list) 
print(response)

source and output PDF screenshots

I'm developer evangelist at aspose.

Share:
18,278
uncrase
Author by

uncrase

Updated on June 08, 2022

Comments

  • uncrase
    uncrase almost 2 years

    I need to generate a customized PDF copy of a template document. The easiest way - I thought - was to create a source PDF that has some placeholder text where customization needs to happen , ie <first_name> and <last_name>, and then replace these with the correct values.

    I've searched high and low, but is there really no way of basically taking the source template PDF, replace the placeholders with actual values and write to a new PDF?

    I looked at PyPDF2 and ReportLab but neither seem to be able to do so. Any suggestions? Most of my searches lead to using a Perl app, CAM::PDF, but I'd prefer to keep it all in Python.

  • Amey P Naik
    Amey P Naik about 5 years
    Did you try the pdf-redactor in python ? i get issues
  • Torsten
    Torsten over 4 years
    The examples on hexapdf site are good and there is a similar one. As far as i understand (not having done this yet) one could also write text into the overlay and have a solid white background, thus overwriting the old, and in that way solve the problem.
  • Farhan Hai Khan
    Farhan Hai Khan over 3 years
    It shows the following message when I logged in : Oops! Something went wrong. There was an error logging into the external provider. The error message is: access_denied Request Id: 904f9113-d60b-4a50-9645-7284f257a0fe
  • Tilal Ahmad
    Tilal Ahmad over 3 years
    @FarhanKhan Please share some more details about the issue you are facing i.e. link or code throwing the error.
  • alias51
    alias51 over 2 years
    @TilalAhmad do you have a python on prem solution for sensitive documents?
  • alias51
    alias51 over 2 years
    How would you return the modified PDF as an object under this example?
  • Mohamed Elhariry
    Mohamed Elhariry about 2 years
    @alias51 from shutil import copyfile response_download = pdf_api.download_file(remote_file) copyfile(response_download, remote_file)