Get request to Google Search

21,119

Solution 1

You can load it in the browser and then scrape results via Javascript.

Or you can use Google API, but seems that it requires payment if you will request it more then 100 times per day.

Solution 2

You now have to use the Google Search API to make your GET requests.

All other methods have been blocked.

Solution 3

The page from your question is the Google Search page with the input field.

Screenshot of https://www.google.ru/?q=1111

The search results page is this one:

https://www.google.ru/search?q=1111

Rotate proxies and user agents, and delay similar requests to get the HTML from Google Search results pages with fewer amount of bans.

Or use SerpApi to access HTML and the extracted data from it. It has a free trial.

curl -s 'https://serpapi.com/search?q=coffee'

Output

{
  // Omitted

  "organic_results": [
    {
      "position": 1,
      "title": "Coffee - Wikipedia",
      "link": "https://en.wikipedia.org/wiki/Coffee",
      "displayed_link": "en.wikipedia.org › wiki › Coffee",
      "snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the seeds of berries from certain Coffea species. When coffee berries turn from green to bright red ...",
      "sitelinks": {
        "expanded": [
          {
            "title": "History",
            "link": "https://en.wikipedia.org/wiki/History_of_coffee",
            "snippet": "The history of coffee dates back to the 15th century, and possibly ..."
          },
          {
            "title": "International Coffee Day",
            "link": "https://en.wikipedia.org/wiki/International_Coffee_Day",
            "snippet": "International Coffee Day (1 October) is an occasion that is ..."
          },
          {
            "title": "List of coffee drinks",
            "link": "https://en.wikipedia.org/wiki/List_of_coffee_drinks",
            "snippet": "Milk coffee - Nitro cold brew coffee - List of coffee dishes - ..."
          },
          {
            "title": "Portal:Coffee",
            "link": "https://en.wikipedia.org/wiki/Portal:Coffee",
            "snippet": "Coffee is a brewed drink prepared from roasted coffee beans, the ..."
          },
          {
            "title": "Coffee bean",
            "link": "https://en.wikipedia.org/wiki/Coffee_bean",
            "snippet": "A coffee bean is a seed of the Coffea plant and the source for ..."
          },
          {
            "title": "Geisha",
            "link": "https://en.wikipedia.org/wiki/Geisha_(coffee)",
            "snippet": "Geisha coffee, sometimes referred to as Gesha coffee, is a type of ..."
          }
        ],
        "list": [
          {
            "date": "Color‎: ‎Black, dark brown, light brown, beige"
          }
        ]
      },
      "rich_snippet": {
        "bottom": {
          "detected_extensions": {
            "introduced_th_century": 15
          },
          "extensions": [
            "Introduced‎: ‎15th century",
            "Color‎: ‎Black, dark brown, light brown, beige"
          ]
        }
      },
      "cached_page_link": "https://webcache.googleusercontent.com/search?q=cache:U6oJMnF-eeUJ:https://en.wikipedia.org/wiki/Coffee+&cd=2&hl=sv&ct=clnk&gl=se",
      "related_pages_link": "https://www.google.se/search?gl=se&hl=sv&q=related:https://en.wikipedia.org/wiki/Coffee+coffee&sa=X&ved=2ahUKEwjJ9p2p_KXuAhVlRN8KHf22D8wQHzABegQIAhAJ"
    }
  },

  // ...
}

Disclaimer: I work at SerpApi.

Solution 4

To add a bit more sauce to the answers as they are not correct and do not even respond to your problem.

First of all, it's perfectly legal to scrape Google as long as you do not harm their service through it (DoS-like).
Also the methods have not been blocked, it's just not that simple.

The speed depends on your methods, it does not have to be very slow..
You can scrape ten thousands of keyword pages in a minute if needed.

You will find a better answer to the topic here: Is it ok to scrape data from Google results?

Your problem with curl comes indeed from protection, Google does not allow automated access and it has a very sophisticated set of detection algorithms.
They go from simple user agent checks (that's what stopped you directly) up to artificial intelligence that tries to detect unusual queries or related queries.

Share:
21,119
Maximus
Author by

Maximus

Updated on July 21, 2022

Comments

  • Maximus
    Maximus almost 2 years

    I'm trying to get HTML with search results from Google. With sending GET request for example to:

    https://www.google.ru/?q=1111
    

    But if in browser all is ok, when I'm trying to use it with curl or to get source with "View source" in Google, there is only some Javascript code, no search result. Is that some type of protection? What can I do?

  • Brian Smith
    Brian Smith over 7 years
    Your method will get blocked pretty quick. Google will present a "we want to make sure your not a robot ..." screen with captcha you must solve in order to continue searching.
  • UndeadDragon
    UndeadDragon about 7 years
    @BrianSmith, yes, of course it will. But only one time per all pages.
  • UndeadDragon
    UndeadDragon about 7 years
    @John One time per one query only (before results comes and then will be no captcha when you clicks pages). Every query. All as I said.
  • user2284570
    user2284570 over 5 years
    the problem is it must made against a specific website.
  • jasonleonhard
    jasonleonhard about 5 years
    Note: This has a cost above x requests.