How can I download a PDF file from an URL where the PDF is embedded into the HTML?

18,637

You can download pdf using requests and BeautifulSoup libraries. In code below replace /Users/../aaa.pdf with full path where document will be downloaded:

import requests
from bs4 import BeautifulSoup

url = 'http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg='

response = requests.post(url)
page = BeautifulSoup(response.text, "html.parser")

VIEWSTATE = page.select_one("#__VIEWSTATE").attrs["value"]
VIEWSTATEGENERATOR = page.select_one("#__VIEWSTATEGENERATOR").attrs["value"]
EVENTVALIDATION = page.select_one("#__EVENTVALIDATION").attrs["value"]
btnDocument = page.select_one("[name=btnDocument]").attrs["value"]

data = {
  '__VIEWSTATE': VIEWSTATE,
  '__VIEWSTATEGENERATOR': VIEWSTATEGENERATOR,
  '__EVENTVALIDATION': EVENTVALIDATION,
  'btnDocument': btnDocument
}
response = requests.post(url, data=data)
with open('/Users/../aaa.pdf', 'wb') as f:
    f.write(response.content)
Share:
18,637
Rick Colgan
Author by

Rick Colgan

I have been playing with computers since 1978. I taught IT classes at Metropolitan Community College and Midland University until 2018. I earned my MS in Software Development in 2019. Now I work as a software developer with PHP. I like Python/Django better, but the legacy PHP/MySQL system I'm working on is a lot of fun.

Updated on June 04, 2022

Comments

  • Rick Colgan
    Rick Colgan almost 2 years

    What I'm trying to do: I want to scrape a web page to get the amount of a financial transaction from a PDF file that is loaded with javascript from a website. Example website: http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg=

    When I click the 'View Document' button, the PDF file loads into my browser's window (I'm using Google Chrome). I can right-click on the PDF and save it to my computer, but I want to automate that process by either having Selenium (or similar package) download that file and then process it for OCR.

    If I can get it saved, I will be able to do the OCR part (I hope). I just can't get the file saved.

    From here, I found and modified this code:

    def download_pdf(lnk):
    
        from selenium import webdriver
        from time import sleep
    
        options = webdriver.ChromeOptions()
    
        download_folder = "C:\\Users\\rickc\\Documents\\Scraper2\\screenshots\\"
    
        profile = {"plugins.plugins_list": [{"enabled": False,
                                             "name": "Chrome PDF Viewer"}],
                   "download.default_directory": download_folder,
                   "download.extensions_to_open": ""}
    
        options.add_experimental_option("prefs", profile)
    
        print("Downloading file from link: {}".format(lnk))
    
        driver = webdriver.Chrome(chrome_options = options)
        driver.get(lnk)
    
        filename = lnk.split("/")[3].split(".aspx")[0]+".pdf"
        print("File: {}".format(filename))
    
        print("Status: Download Complete.")
        print("Folder: {}".format(download_folder))
    
        driver.close()
    
    download_pdf('http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9fVs5YdPg=')
    

    But it isn't working. My old college professor once said, "If you've spent more than two hours on the problem and haven't made headway, it's time to look for help elsewhere." So I'm looking for help.

    Other info: The link above will take you to a web page, but you can't access the PDF document until you click on the 'View Document' button. I've tried using Selenium's webdriver.find_element_by_ID('btnDocument').click() to make things happen, and it just loads the page but doesn't do anything with it.

    • Yinda Yin
      Yinda Yin about 5 years
      It's worse than you think. That PDF is essentially a container for an image. To get any meaningful information from it, you're going to have to OCR it.
    • Sers
      Sers about 5 years
      How to download a pdf file you can find here
    • Rick Colgan
      Rick Colgan about 5 years
      @Sers -- based on the link you provided, I modified my code to look like this: profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], "plugins.always_open_pdf_externally": True, "download.default_directory": download_folder, "download.extensions_to_open": ""} but that's not doing the trick either. Am I just missing something?
  • Rick Colgan
    Rick Colgan about 5 years
    That was the magic! Thank you @Sers ! I'm not sure I fully understand the code you have there, but it works for me so that I can open the PDF now and use another module to scan for the dollar sign and get the amount I need.