How to get google search results
Solution 1
If you look at the html
variable, you can see that the search result links all are nested in <h3 class="r">
tags.
Try to change your getGoogleLinks
function to:
getGoogleLinks <- function(google.url) {
doc <- getURL(google.url, httpheader = c("User-Agent" = "R
(2.10.0)"))
html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function
(...){})
nodes <- getNodeSet(html, "//h3[@class='r']//a")
return(sapply(nodes, function(x) x <- xmlAttrs(x)[["href"]]))
}
Solution 2
I created this function to read in a list of company names and then get the top website result for each. It will get you started then you can adjust it as needed.
#libraries.
library(URLencode)
library(rvest)
#load data
d <-read.csv("P:\\needWebsites.csv")
c <- as.character(d$Company.Name)
# Function for getting website.
getWebsite <- function(name)
{
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- read_html(url)
results <- page %>%
html_nodes("cite") %>% # Get all notes of type cite. You can change this to grab other node types.
html_text()
result <- results[1]
return(as.character(result)) # Return results if you want to see them all.
}
# Apply the function to a list of company names.
websites <- data.frame(Website = sapply(c,getWebsite))]
Solution 3
other solutions here don't work for me, here's my take on @Bryce-Chamberlain's issue which works for me in August 2019, it answers also another closed question : company name to URL in R
# install.packages("rvest")
get_first_google_link <- function(name, root = TRUE) {
url = URLencode(paste0("https://www.google.com/search?q=",name))
page <- xml2::read_html(url)
# extract all links
nodes <- rvest::html_nodes(page, "a")
links <- rvest::html_attr(nodes,"href")
# extract first link of the search results
link <- links[startsWith(links, "/url?q=")][1]
# clean it
link <- sub("^/url\\?q\\=(.*?)\\&sa.*$","\\1", link)
# get root if relevant
if(root) link <- sub("^(https?://.*?/).*$", "\\1", link)
link
}
companies <- data.frame(company = c("apple acres llc","abbvie inc","apple inc"))
companies <- transform(companies, url = sapply(company,get_first_google_link))
companies
#> company url
#> 1 apple acres llc https://www.appleacresllc.com/
#> 2 abbvie inc https://www.abbvie.com/
#> 3 apple inc https://www.apple.com/
Created on 2019-08-10 by the reprex package (v0.2.1)
Related videos on Youtube
Avi
Updated on July 13, 2022Comments
-
Avi almost 2 years
I used the following code:
library(XML) library(RCurl) getGoogleURL <- function(search.term, domain = '.co.uk', quotes=TRUE) { search.term <- gsub(' ', '%20', search.term) if(quotes) search.term <- paste('%22', search.term, '%22', sep='') getGoogleURL <- paste('http://www.google', domain, '/search?q=', search.term, sep='') } getGoogleLinks <- function(google.url) { doc <- getURL(google.url, httpheader = c("User-Agent" = "R(2.10.0)")) html <- htmlTreeParse(doc, useInternalNodes = TRUE, error=function(...){}) nodes <- getNodeSet(html, "//a[@href][@class='l']") return(sapply(nodes, function(x) x <- xmlAttrs(x)[[1]])) } search.term <- "cran" quotes <- "FALSE" search.url <- getGoogleURL(search.term=search.term, quotes=quotes) links <- getGoogleLinks(search.url)
I would like to find all the links that resulted from my search and I get the following result:
> links list()
How can I get the links? In addition I would like to get the headlines and summary of google results how can I get it? And finally is there a way to get the links that resides in ChillingEffects.org results?
-
hrbrmstr over 8 years
-
-
Therii almost 5 yearsHello Bryce Sir. I have written a program taking a inspiration from ur program. But I'm getting character 0. Plzz help.
-
Therii almost 5 yearsr_h = read_html("google.com/…) ; d = r_h %>% html_nodes(".iUh30") %>% html_text() %>% as.character()
-
hoang tran almost 5 yearsHi, I do exactly the same thing but my nodes equals NULL. What could be wrong? Thank you!
-
hoang tran almost 5 yearsCan you please also explain me how to choose the "//h3[@class='r']//a", based on what?
-
user3794498 almost 5 yearsGoogle has changed their website, so the results are no longer nested in h3 tags. When looking for nodes, "//h3[@class='r']//a" means to look for 'a' nodes (i.e. links) nodes nested anywhere in 'h3' nodes (i.e. level 3 headers) with class 3 anywhere in the document.
-
moodymudskipper over 4 years@Therii maybe my answer will help