R web scraping across multiple pages
Solution 1
You can do something similar with purrr::map_df()
as well if you want all the info as a data.frame
:
library(rvest)
library(purrr)
url_base <- "http://www.winemag.com/?s=washington merlot&drink_type=wine&page=%d"
map_df(1:39, function(i) {
# simple but effective progress indicator
cat(".")
pg <- read_html(sprintf(url_base, i))
data.frame(wine=html_text(html_nodes(pg, ".review-listing .title")),
excerpt=html_text(html_nodes(pg, "div.excerpt")),
rating=gsub(" Points", "", html_text(html_nodes(pg, "span.rating"))),
appellation=html_text(html_nodes(pg, "span.appellation")),
price=gsub("\\$", "", html_text(html_nodes(pg, "span.price"))),
stringsAsFactors=FALSE)
}) -> wines
dplyr::glimpse(wines)
## Observations: 1,170
## Variables: 5
## $ wine (chr) "Charles Smith 2012 Royal City Syrah (Columbia Valley (WA)...
## $ excerpt (chr) "Green olive, green stem and fresh herb aromas are at the ...
## $ rating (chr) "96", "95", "94", "93", "93", "93", "93", "93", "93", "93"...
## $ appellation (chr) "Columbia Valley", "Columbia Valley", "Columbia Valley", "...
## $ price (chr) "140", "70", "70", "20", "70", "40", "135", "50", "60", "3...
Solution 2
You can lapply
across a vector of the URLs, which you can make by pasting the base URL to a sequence:
library(rvest)
wines <- lapply(paste0('http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=', 1:39),
function(url){
url %>% read_html() %>%
html_nodes(".review-listing .title") %>%
html_text()
})
The result will be returned in a list with an element for each page.
Jamie Leigh
Trying to balance a life with a full time job (microbiologist) and being a full time Masters student in Bioinformatics. It isn't going too well. I enjoy good wine, great coffee, and currently my favorite hobby is to sleep since I don't normally get a lot of opportunity. If you are reading this, help me. I'm lost in my computer.
Updated on July 27, 2022Comments
-
Jamie Leigh almost 2 years
I am working on a web scraping program to search for specific wines and return a list of local wines of that variety. The problem I am having is multiple page results. The code below is a basic example of what I am working with
url2 <- "http://www.winemag.com/?s=washington+merlot&search_type=reviews" htmlpage2 <- read_html(url2) names2 <- html_nodes(htmlpage2, ".review-listing .title") Wines2 <- html_text(names2)
For this specific search there are 39 pages of results. I know the url changes to http://www.winemag.com/?s=washington%20merlot&drink_type=wine&page=2, but is there an easy way to make the code loop through all the returned pages and compile the results from all 39 pages into a single list? I know I can manually do all the urls, but that seems like overkill.
-
ASH almost 8 yearsVery nice alistaire!! Can you explain how this works? Thanks.
-
alistaire almost 8 yearsIt pastes together a vector of URLs, one for each page, and then
lapply
runs the function on each one. The function is anrvest
chain that reads the HTML at that URL, selects the nodes with the specified classes (i.e. the titles), and grabs the text from inside those nodes. It returns a list item for each time it runs the function, but if you want to collapse them all into one vector, just rununlist(wines)
. If you want to grab other elements for each wine as well, you can assemble them all into a data.frame. -
Rhodo about 7 yearsI love this code. I like to replace
cat(".")
withcat("boom! ")
. Personal preference I guess. -
Mostafa90 over 4 yearsPlease if I want to click on "SEE FULL REVIEW" (for each row) thats open a new web age, have I to use
RSelenium
? -
Mostafa90 over 4 yearsPlease if I want to click on "SEE FULL REVIEW" (for each row) thats open a new web age, have I to use RSelenium?
-
alistaire over 4 yearsNo; each row is wrapped in an
<a>
tag with anhref
attribute which is the URL to the review, so you could get a vector of the URLs for further work with something likepage %>% html_nodes('a.review-listing') %>% html_attr('href')