Scrape password-protected website in R

xml r web-scraping rcurl httr

26,507

Solution 1

I don't have an account to test with, but maybe this will work:

library(httr)
library(XML)

handle <- handle("http://subscribers.footballguys.com") 
path   <- "amember/login.php"

# fields found in the login form.
login <- list(
  amember_login = "username"
 ,amember_pass  = "password"
 ,amember_redirect_url = 
   "http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2"
)

response <- POST(handle = handle, path = path, body = login)

Now, the response object might hold what you need (or maybe you can directly query the page of interest after the login request; I am not sure the redirect will work, but it is a field in the web form), and handle might be re-used for subsequent requests. Can't test it; but this works for me in many situations.

You can output the table using XML

> readHTMLTable(content(response))[[1]][1:5,]
  Rank             Name Tm/Bye Age Exp Cmp Att  Cm%  PYd Y/Att PTD Int Rsh  Yd TD FantPt
1    1   Peyton Manning  DEN/4  38  17 415 620 66.9 4929  7.95  43  12  24   7  0 407.15
2    2       Drew Brees   NO/6  35  14 404 615 65.7 4859  7.90  37  16  22  44  1 385.35
3    3    Aaron Rodgers   GB/9  31  10 364 560 65.0 4446  7.94  33  13  52 224  3 381.70
4    4      Andrew Luck IND/10  25   3 366 610 60.0 4423  7.25  27  13  62 338  2 361.95
5    5 Matthew Stafford  DET/9  26   6 377 643 58.6 4668  7.26  32  19  34 102  1 358.60

Solution 2

You can use RSelenium. I have used the dev version as you can run phantomjs without a Selenium Server.

# Install RSelenium if required. You will need phantomjs in your path or follow instructions
# in package vignettes
# devtools::install_github("ropensci/RSelenium")
# login first
appURL <- 'http://subscribers.footballguys.com/amember/login.php'
library(RSelenium)
pJS <- phantom() # start phantomjs
remDr <- remoteDriver(browserName = "phantomjs")
remDr$open()
remDr$navigate(appURL)
remDr$findElement("id", "login")$sendKeysToElement(list("myusername"))
remDr$findElement("id", "pass")$sendKeysToElement(list("mypass"))
remDr$findElement("css", ".am-login-form input[type='submit']")$clickElement()

appURL <- 'http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2'
remDr$navigate(appURL)
tableElem<- remDr$findElement("css", "table.datamedium")
res <- readHTMLTable(header = TRUE, tableElem$getElementAttribute("outerHTML")[[1]])
> res[[1]][1:5, ]
Rank             Name Tm/Bye Age Exp Cmp Att  Cm%  PYd Y/Att PTD Int Rsh  Yd TD FantPt
1    1   Peyton Manning  DEN/4  38  17 415 620 66.9 4929  7.95  43  12  24   7  0 407.15
2    2       Drew Brees   NO/6  35  14 404 615 65.7 4859  7.90  37  16  22  44  1 385.35
3    3    Aaron Rodgers   GB/9  31  10 364 560 65.0 4446  7.94  33  13  52 224  3 381.70
4    4      Andrew Luck IND/10  25   3 366 610 60.0 4423  7.25  27  13  62 338  2 361.95
5    5 Matthew Stafford  DET/9  26   6 377 643 58.6 4668  7.26  32  19  34 102  1 358.60

Finally when you are finished close phantomjs

pJS$stop()

If you want to use a traditional browser like firefox for example (if you wanted to stick to the version on CRAN) you would use:

RSelenium::startServer()
remDr <- remoteDriver()
........
........
remDr$closeServer()

in place of the related phantomjs calls.

26,507

itpetersen

Updated on July 09, 2022

Comments

itpetersen almost 2 years
I'm trying to scrape data from a password-protected website in R. Reading around, it seems that the httr and RCurl packages are the best options for scraping with password authentication (I've also looked into the XML package).

The website I'm trying to scrape is below (you need a free account in order to access the full page): http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2

Here are my two attempts (replacing "username" with my username and "password" with my password):
```
#This returns "Status: 200" without the data from the page:
library(httr)
GET("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", authenticate("username", "password"))

#This returns the non-password protected preview (i.e., not the full page):
library(XML)
library(RCurl)
readHTMLTable(getURL("http://subscribers.footballguys.com/myfbg/myviewprojections.php?projector=2", userpwd = "username:password"))
```
I have looked at other relevant posts (links below), but can't figure out how to apply their answers to my case.

How to use R to download a zipped file from a SSL page that requires cookies

How to webscrape secured pages in R (https links) (using readHTMLTable from XML package)?

Reading information from a password protected site

R - RCurl scrape data from a password-protected site

http://www.inside-r.org/questions/how-scrape-data-password-protected-https-website-using-r-hold
jdharrison almost 10 years

This works for me. I have edited with content output
itpetersen almost 10 years

I tested both answers and they both work great. I selected this one for its simplicity.
Stefan almost 10 years

Perhaps for other sites RSelenium might come in handy; the websites are not always as straight-forward as this one.. I am going to keep the phantomjs in mind.
Steve G. Jones over 7 years

Thanks, this is a very versatile approach to solve this.
runr about 7 years

While overall this is a very useful answer, it can be noted that lately the package advanced a bit, allowing for more convenient browsing through chrome, firefox or IE without the need of phantomjs, for example, using rD <- RSelenium::rsDriver(port = 5555L, 'firefox'); remDr <- rD[["client"]] and following the original answer afterwards.
jdharrison about 7 years

@Nutle good points and the phantom function is deprecated in favour of wdman::phantomjs so maybe this answer needs updating
Cyrus Mohammadian about 5 years

How was the information regarding the login form found?