Scraping Google images with Python

python web-scraping python-2.x

12,388

Solution 1

it seems that Google has recently removed the metadata from the image search result, i.e. you won't find rg_meta in the HTML. Therefore, soup.find_all("div",{"class":"rg_meta"}): will not return anything.

I haven't found a solution for this. I believe Google made this change for the very purpose of preventing scraping.

Solution 2

I haven't seen anyone mention this. Not an ideal solution but if you want something simple that works and doesn't take any hassle to setup you can use selenium. Since google seems to intentionally be preventing image scraping as Densus mentioned perhaps this would be inappropriate usage of selenium, I'm not sure.

There's plenty of public, working selenium google image scrapers on github that you can view and use. In fact, if you search for any recent python google image scraper on github I think most if not all of them will be selenium implementations.

For example: https://github.com/Msalmannasir/Google_image_scraper

This one, just download the chromium driver and update the filepath to it in the code.

12,388

Author by

shawnin damnen

Updated on June 12, 2022

Comments

shawnin damnen almost 2 years

I'm trying to learn Python scraping and came across a program to scrape a set number of images from Google image search results

I changed it to go for 5 images, it was working for a while but it stopped working recently with showing outputs such as there are 0 images

import requests
import re
import urllib2
import os
import cookielib
import json

def get_soup(url,header):
    return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser')


query = raw_input("query image")# you can change the query for the image  here
image_type="ActiOn"
query= query.split()
query='+'.join(query)
url="https://www.google.com/search?q="+query+"&source=lnms&tbm=isch"
print url
#add the directory for your image here
DIR="C:\Users\mynam\Desktop\WB"
header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36"
}
soup = get_soup(url,header)


ActualImages=[]# contains the link for Large original images, type of  image
for a in soup.find_all("div",{"class":"rg_meta"}):
    link , Type =json.loads(a.text)["ou"]  ,json.loads(a.text)["ity"]
    ActualImages.append((link,Type))

print  "there are total" , len(ActualImages),"images"

if not os.path.exists(DIR):
            os.mkdir(DIR)
DIR = os.path.join(DIR, query.split()[0])

if not os.path.exists(DIR):
            os.mkdir(DIR)
###print images
for i , (img , Type) in enumerate(ActualImages[0:5]):
    try:
        req = urllib2.Request(img, headers={'User-Agent' : header})
        raw_img = urllib2.urlopen(req).read()

        cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1
        print cntr
        if len(Type)==0:
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb')
        else :
            f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb')


        f.write(raw_img)
        f.close()
    except Exception as e:
        print "could not load : "+img
        print e

There are no error logs, the file gets created and it is empty. The ActualImages array remains empty for some reason.