R web scraping across multiple pages

15,075

Solution 1

You can do something similar with purrr::map_df() as well if you want all the info as a data.frame:

library(rvest)
library(purrr)

url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"

map_df(1:39, function(i) {

  # simple but effective progress indicator
  cat(".")

  pg <- read_html(sprintf(url_base, i))

  data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
             excerpt=html_text(html_nodes(pg, "div.excerpt")),
             rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
             appellation=html_text(html_nodes(pg, "span.appellation")),
             price=gsub("\\$", "", html_text(html_nodes(pg, "span.price"))),
             stringsAsFactors=FALSE)

}) -> wines

dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $ wine        (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $ excerpt     (chr) "Green olive, green stem and fresh herb aromas are at the ...
## $ rating      (chr) "96", "95", "94", "93", "93", "93", "93", "93", "93", "93"...
## $ appellation (chr) "Columbia Valley", "Columbia Valley", "Columbia Valley", "...
## $ price       (chr) "140", "70", "70", "20", "70", "40", "135", "50", "60", "3...

Solution 2

You can lapply across a vector of the URLs, which you can make by pasting the base URL to a sequence:

library(rvest)

wines <- lapply(paste0('http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=', 1:39),
                function(url){
                    url %>% read_html() %>% 
                        html_nodes(".review-listing .title") %>% 
                        html_text()
                })

The result will be returned in a list with an element for each page.

Share:
15,075
Jamie Leigh
Author by

Jamie Leigh

Trying to balance a life with a full time job (microbiologist) and being a full time Masters student in Bioinformatics. It isn't going too well. I enjoy good wine, great coffee, and currently my favorite hobby is to sleep since I don't normally get a lot of opportunity. If you are reading this, help me. I'm lost in my computer.

Updated on July 27, 2022

Comments

  • Jamie Leigh
    Jamie Leigh almost 2 years

    I am working on a web scraping program to search for specific wines and return a list of local wines of that variety. The problem I am having is multiple page results. The code below is a basic example of what I am working with

    url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews"
    htmlpage2 <- read_html(url2)
    names2 <- html_nodes(htmlpage2, ".review-listing .title")
    Wines2 <- html_text(names2)
    

    For this specific search there are 39 pages of results. I know the url changes to http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2, but is there an easy way to make the code loop through all the returned pages and compile the results from all 39 pages into a single list? I know I can manually do all the urls, but that seems like overkill.

  • ASH
    ASH almost 8 years
    Very nice alistaire!! Can you explain how this works? Thanks.
  • alistaire
    alistaire almost 8 years
    It pastes together a vector of URLs, one for each page, and then lapply runs the function on each one. The function is an rvest chain that reads the HTML at that URL, selects the nodes with the specified classes (i.e. the titles), and grabs the text from inside those nodes. It returns a list item for each time it runs the function, but if you want to collapse them all into one vector, just run unlist(wines). If you want to grab other elements for each wine as well, you can assemble them all into a data.frame.
  • Rhodo
    Rhodo about 7 years
    I love this code. I like to replace cat(".") with cat("boom! ") . Personal preference I guess.
  • Mostafa90
    Mostafa90 over 4 years
    Please if I want to click on "SEE FULL REVIEW" (for each row) thats open a new web age, have I to use RSelenium?
  • Mostafa90
    Mostafa90 over 4 years
    Please if I want to click on "SEE FULL REVIEW" (for each row) thats open a new web age, have I to use RSelenium?
  • alistaire
    alistaire over 4 years
    No; each row is wrapped in an <a> tag with an href attribute which is the URL to the review, so you could get a vector of the URLs for further work with something like page %>% html_nodes('a.review-listing') %>% html_attr('href')