Scrape password-protected website in R

26,507

Solution 1

I don't have an account to test with, but maybe this will work:

library(httr)
library(XML)

handle <- handle("http://subscribers.footballguys.com") 
path   <- "amember/login.php"

# fields found in the login form.
login <- list(
  amember_login = "username"
 ,amember_pass  = "password"
 ,amember_redirect_url = 
   "http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)

response <- POST(handle = handle, path = path, body = login)

Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle might be re-used for subsequent requests. Can't test it; but this works for me in many situations.

You can output the table using XML

> readHTMLTable(content(response))[[1]][1:5,]
  Rank             Name Tm/Bye Age Exp Cmp Att  Cm%  PYd Y/Att PTD Int Rsh  Yd TD FantPt
1    1   Peyton Manning  DEN/4  38  17 415 620 66.9 4929  7.95  43  12  24   7  0 407.15
2    2       Drew Brees   NO/6  35  14 404 615 65.7 4859  7.90  37  16  22  44  1 385.35
3    3    Aaron Rodgers   GB/9  31  10 364 560 65.0 4446  7.94  33  13  52 224  3 381.70
4    4      Andrew Luck IND/10  25   3 366 610 60.0 4423  7.25  27  13  62 338  2 361.95
5    5 Matthew Stafford  DET/9  26   6 377 643 58.6 4668  7.26  32  19  34 102  1 358.60

Solution 2

You can use RSelenium. I have used the dev version as you can run phantomjs without a Selenium Server.

# Install RSelenium if required. You will need phantomjs in your path or follow instructions
# in package vignettes
# devtools::install_github("ropensci/RSelenium")
# login first
appURL <- 'http://subscribers.footballguys.com/amember/login.php'
library(RSelenium)
pJS <- phantom() # start phantomjs
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "login")$sendKeysToElement(list("myusername"))
remDr$findElement("id", "pass")$sendKeysToElement(list("mypass"))
remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement()

appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2'
remDr$navigate(appURL)
tableElem<- remDr$findElement("css", "table.datamedium")
res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
> res[[1]][1:5, ]
Rank             Name Tm/Bye Age Exp Cmp Att  Cm%  PYd Y/Att PTD Int Rsh  Yd TD FantPt
1    1   Peyton Manning  DEN/4  38  17 415 620 66.9 4929  7.95  43  12  24   7  0 407.15
2    2       Drew Brees   NO/6  35  14 404 615 65.7 4859  7.90  37  16  22  44  1 385.35
3    3    Aaron Rodgers   GB/9  31  10 364 560 65.0 4446  7.94  33  13  52 224  3 381.70
4    4      Andrew Luck IND/10  25   3 366 610 60.0 4423  7.25  27  13  62 338  2 361.95
5    5 Matthew Stafford  DET/9  26   6 377 643 58.6 4668  7.26  32  19  34 102  1 358.60

Finally when you are finished close phantomjs

pJS$stop()

If you want to use a traditional browser like firefox for example (if you wanted to stick to the version on CRAN) you would use:

RSelenium::startServer()
remDr <- remoteDriver()
........
........
remDr$closeServer()

in place of the related phantomjs calls.

Share:
26,507

Related videos on Youtube

itpetersen
Author by

itpetersen

Updated on July 09, 2022

Comments

  • itpetersen
    itpetersen almost 2 years

    I'm trying to scrape data from a password-protected website in R. Reading around, it seems that the httr and RCurl packages are the best options for scraping with password authentication (I've also looked into the XML package).

    The website I'm trying to scrape is below (you need a free account in order to access the full page): http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2

    Here are my two attempts (replacing "username" with my username and "password" with my password):

    #This returns "Status: 200" without the data from the page:
    library(httr)
    GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password"))
    
    #This returns the non-password protected preview (i.e., not the full page):
    library(XML)
    library(RCurl)
    readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))
    

    I have looked at other relevant posts (links below), but can't figure out how to apply their answers to my case.

    How to use R to download a zipped file from a SSL page that requires cookies

    How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

    Reading information from a password protected site

    R - RCurl scrape data from a password-protected site

    http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold

  • jdharrison
    jdharrison almost 10 years
    This works for me. I have edited with content output
  • itpetersen
    itpetersen almost 10 years
    I tested both answers and they both work great. I selected this one for its simplicity.
  • Stefan
    Stefan almost 10 years
    Perhaps for other sites RSelenium might come in handy; the websites are not always as straight-forward as this one.. I am going to keep the phantomjs in mind.
  • Steve G. Jones
    Steve G. Jones over 7 years
    Thanks, this is a very versatile approach to solve this.
  • runr
    runr about 7 years
    While overall this is a very useful answer, it can be noted that lately the package advanced a bit, allowing for more convenient browsing through chrome, firefox or IE without the need of phantomjs, for example, using rD <- RSelenium::rsDriver(port = 5555L, 'firefox'); remDr <- rD[["client"]] and following the original answer afterwards.
  • jdharrison
    jdharrison about 7 years
    @Nutle good points and the phantom function is deprecated in favour of wdman::phantomjs so maybe this answer needs updating
  • Cyrus Mohammadian
    Cyrus Mohammadian about 5 years
    How was the information regarding the login form found?