R web scraping across multiple pages

html r web-scraping rvest

15,075

Solution 1

You can do something similar with purrr::map_df() as well if you want all the info as a data.frame:

library(rvest)
library(purrr)

url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"

map_df(1:39, function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base, i))

  data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
             excerpt=html_text(html_nodes(pg, "div.excerpt")),
             rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
             appellation=html_text(html_nodes(pg, "span.appellation")),
             price=gsub("\\$", "", html_text(html_nodes(pg, "span.price"))),
             stringsAsFactors=FALSE)

}) -> wines

dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $ wine        (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $ excerpt     (chr) "Green olive, green stem and fresh herb aromas are at the ...
## $ rating      (chr) "96", "95", "94", "93", "93", "93", "93", "93", "93", "93"...
## $ appellation (chr) "Columbia Valley", "Columbia Valley", "Columbia Valley", "...
## $ price       (chr) "140", "70", "70", "20", "70", "40", "135", "50", "60", "3...

Solution 2

You can lapply across a vector of the URLs, which you can make by pasting the base URL to a sequence:

library(rvest)

wines <- lapply(paste0('http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=', 1:39),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes(".review-listing .title") %>% 
                        html_text()
                })

The result will be returned in a list with an element for each page.

15,075

Author by

Jamie Leigh

Trying to balance a life with a full time job (microbiologist) and being a full time Masters student in Bioinformatics. It isn't going too well. I enjoy good wine, great coffee, and currently my favorite hobby is to sleep since I don't normally get a lot of opportunity. If you are reading this, help me. I'm lost in my computer.

Updated on July 27, 2022

Comments

Jamie Leigh almost 2 years
I am working on a web scraping program to search for specific wines and return a list of local wines of that variety. The problem I am having is multiple page results. The code below is a basic example of what I am working with
```
url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews"
htmlpage2 <- read_html(url2)
names2 <- html_nodes(htmlpage2, ".review-listing .title")
Wines2 <- html_text(names2)
```
For this specific search there are 39 pages of results. I know the url changes to http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2, but is there an easy way to make the code loop through all the returned pages and compile the results from all 39 pages into a single list? I know I can manually do all the urls, but that seems like overkill.
ASH almost 8 years

Very nice alistaire!! Can you explain how this works? Thanks.
alistaire almost 8 years

It pastes together a vector of URLs, one for each page, and then lapply runs the function on each one. The function is an rvest chain that reads the HTML at that URL, selects the nodes with the specified classes (i.e. the titles), and grabs the text from inside those nodes. It returns a list item for each time it runs the function, but if you want to collapse them all into one vector, just run unlist(wines). If you want to grab other elements for each wine as well, you can assemble them all into a data.frame.
Rhodo about 7 years

I love this code. I like to replace cat(".") with cat("boom! ") . Personal preference I guess.
Mostafa90 over 4 years

Please if I want to click on "SEE FULL REVIEW" (for each row) thats open a new web age, have I to use RSelenium?
Mostafa90 over 4 years

Please if I want to click on "SEE FULL REVIEW" (for each row) thats open a new web age, have I to use RSelenium?
alistaire over 4 years

No; each row is wrapped in an <a> tag with an href attribute which is the URL to the review, so you could get a vector of the URLs for further work with something like page %>% html_nodes('a.review-listing') %>% html_attr('href')