How can I download a PDF file from an URL where the PDF is embedded into the HTML?

python-3.x selenium pdf web-scraping

18,637

You can download pdf using requests and BeautifulSoup libraries. In code below replace /Users/../aaa.pdf with full path where document will be downloaded:

import requests
from bs4 import BeautifulSoup

url = 'http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg='

response = requests.post(url)
page = BeautifulSoup(response.text, "html.parser")

VIEWSTATE = page.select_one("#__VIEWSTATE").attrs["value"]
VIEWSTATEGENERATOR = page.select_one("#__VIEWSTATEGENERATOR").attrs["value"]
EVENTVALIDATION = page.select_one("#__EVENTVALIDATION").attrs["value"]
btnDocument = page.select_one("[name=btnDocument]").attrs["value"]

data = {
  '__VIEWSTATE': VIEWSTATE,
  '__VIEWSTATEGENERATOR': VIEWSTATEGENERATOR,
  '__EVENTVALIDATION': EVENTVALIDATION,
  'btnDocument': btnDocument
}
response = requests.post(url, data=data)
with open('/Users/../aaa.pdf', 'wb') as f:
    f.write(response.content)

18,637

Author by

Rick Colgan

I have been playing with computers since 1978. I taught IT classes at Metropolitan Community College and Midland University until 2018. I earned my MS in Software Development in 2019. Now I work as a software developer with PHP. I like Python/Django better, but the legacy PHP/MySQL system I'm working on is a lot of fun.

Updated on June 04, 2022

Comments

Rick Colgan almost 2 years
What I'm trying to do: I want to scrape a web page to get the amount of a financial transaction from a PDF file that is loaded with javascript from a website. Example website: http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9%2fVs5YdPg=

When I click the 'View Document' button, the PDF file loads into my browser's window (I'm using Google Chrome). I can right-click on the PDF and save it to my computer, but I want to automate that process by either having Selenium (or similar package) download that file and then process it for OCR.

If I can get it saved, I will be able to do the OCR part (I hope). I just can't get the file saved.

From here, I found and modified this code:
```
def download_pdf(lnk):

    from selenium import webdriver
    from time import sleep

    options = webdriver.ChromeOptions()

    download_folder = "C:\\Users\\rickc\\Documents\\Scraper2\\screenshots\\"

    profile = {"plugins.plugins_list": [{"enabled": False,
                                         "name": "Chrome PDF Viewer"}],
               "download.default_directory": download_folder,
               "download.extensions_to_open": ""}

    options.add_experimental_option("prefs", profile)

    print("Downloading file from link: {}".format(lnk))

    driver = webdriver.Chrome(chrome_options = options)
    driver.get(lnk)

    filename = lnk.split("/")[3].split(".aspx")[0]+".pdf"
    print("File: {}".format(filename))

    print("Status: Download Complete.")
    print("Folder: {}".format(download_folder))

    driver.close()

download_pdf('http://www.nebraskadeedsonline.us/document.aspx?g5savSPtTDnumMn1bRBWoKqN6Gu65tBhDE9fVs5YdPg=')
```
But it isn't working. My old college professor once said, "If you've spent more than two hours on the problem and haven't made headway, it's time to look for help elsewhere." So I'm looking for help.

Other info: The link above will take you to a web page, but you can't access the PDF document until you click on the 'View Document' button. I've tried using Selenium's webdriver.find_element_by_ID('btnDocument').click() to make things happen, and it just loads the page but doesn't do anything with it.
- Yinda Yin about 5 years
  
  It's worse than you think. That PDF is essentially a container for an image. To get any meaningful information from it, you're going to have to OCR it.
- Sers about 5 years
  
  How to download a pdf file you can find here
- Rick Colgan about 5 years
  
  @Sers -- based on the link you provided, I modified my code to look like this: profile = {"plugins.plugins_list": [{"enabled": False, "name": "Chrome PDF Viewer"}], "plugins.always_open_pdf_externally": True, "download.default_directory": download_folder, "download.extensions_to_open": ""} but that's not doing the trick either. Am I just missing something?
Rick Colgan about 5 years

That was the magic! Thank you @Sers ! I'm not sure I fully understand the code you have there, but it works for me so that I can open the PDF now and use another module to scan for the dollar sign and get the amount I need.