Scraping Google images with Python

12,388

Solution 1

it seems that Google has recently removed the metadata from the image search result, i.e. you won't find rg_meta in the HTML. Therefore, soup.find_all("div",{"class":"rg_meta"}): will not return anything.

I haven't found a solution for this. I believe Google made this change for the very purpose of preventing scraping.

Solution 2

I haven't seen anyone mention this. Not an ideal solution but if you want something simple that works and doesn't take any hassle to setup you can use selenium. Since google seems to intentionally be preventing image scraping as Densus mentioned perhaps this would be inappropriate usage of selenium, I'm not sure.

There's plenty of public, working selenium google image scrapers on github that you can view and use. In fact, if you search for any recent python google image scraper on github I think most if not all of them will be selenium implementations.

For example: https://github.com/Msalmannasir/Google_image_scraper

This one, just download the chromium driver and update the filepath to it in the code.

Share:
12,388
shawnin damnen
Author by

shawnin damnen

Updated on June 12, 2022

Comments

  • shawnin damnen
    shawnin damnen almost 2 years

    I'm trying to learn Python scraping and came across a program to scrape a set number of images from Google image search results

    I changed it to go for 5 images, it was working for a while but it stopped working recently with showing outputs such as there are 0 images

    import requests
    import re
    import urllib2
    import os
    import cookielib
    import json
    
    def get_soup(url,header):
        return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')
    
    
    query = raw_input("query image")# you can change the query for the image  here
    image_type="ActiOn"
    query= query.split()
    query='+'.join(query)
    url="https://www.google.com/search?q="+query+"&source=lnms&tbm=isch"
    print url
    #add the directory for your image here
    DIR="C:\Users\mynam\Desktop\WB"
    header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
    }
    soup = get_soup(url,header)
    
    
    ActualImages=[]# contains the link for Large original images, type of  image
    for a in soup.find_all("div",{"class":"rg_meta"}):
        link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
        ActualImages.append((link,Type))
    
    print  "there are total" , len(ActualImages),"images"
    
    if not os.path.exists(DIR):
                os.mkdir(DIR)
    DIR = os.path.join(DIR, query.split()[0])
    
    if not os.path.exists(DIR):
                os.mkdir(DIR)
    ###print images
    for i , (img , Type) in enumerate(ActualImages[0:5]):
        try:
            req = urllib2.Request(img, headers={'User-Agent' : header})
            raw_img = urllib2.urlopen(req).read()
    
            cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
            print cntr
            if len(Type)==0:
                f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
            else :
                f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')
    
    
            f.write(raw_img)
            f.close()
        except Exception as e:
            print "could not load : "+img
            print e
    

    There are no error logs, the file gets created and it is empty. The ActualImages array remains empty for some reason.