Scraping Google images with Python
Solution 1
it seems that Google has recently removed the metadata from the image search result, i.e. you won't find rg_meta
in the HTML. Therefore, soup.find_all("div",{"class":"rg_meta"}):
will not return anything.
I haven't found a solution for this. I believe Google made this change for the very purpose of preventing scraping.
Solution 2
I haven't seen anyone mention this. Not an ideal solution but if you want something simple that works and doesn't take any hassle to setup you can use selenium. Since google seems to intentionally be preventing image scraping as Densus mentioned perhaps this would be inappropriate usage of selenium, I'm not sure.
There's plenty of public, working selenium google image scrapers on github that you can view and use. In fact, if you search for any recent python google image scraper on github I think most if not all of them will be selenium implementations.
For example: https://github.com/Msalmannasir/Google_image_scraper
This one, just download the chromium driver and update the filepath to it in the code.
shawnin damnen
Updated on June 12, 2022Comments
-
shawnin damnen almost 2 years
I'm trying to learn Python scraping and came across a program to scrape a set number of images from Google image search results
I changed it to go for 5 images, it was working for a while but it stopped working recently with showing outputs such as
there are 0 images
import requests import re import urllib2 import os import cookielib import json def get_soup(url,header): return BeautifulSoup(urllib2.urlopen(urllib2.Request(url,headers=header)),'html.parser') query = raw_input("query image")# you can change the query for the image here image_type="ActiOn" query= query.split() query='+'.join(query) url="https://www.google.com/search?q="+query+"&source=lnms&tbm=isch" print url #add the directory for your image here DIR="C:\Users\mynam\Desktop\WB" header={'User-Agent':"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/43.0.2357.134 Safari/537.36" } soup = get_soup(url,header) ActualImages=[]# contains the link for Large original images, type of image for a in soup.find_all("div",{"class":"rg_meta"}): link , Type =json.loads(a.text)["ou"] ,json.loads(a.text)["ity"] ActualImages.append((link,Type)) print "there are total" , len(ActualImages),"images" if not os.path.exists(DIR): os.mkdir(DIR) DIR = os.path.join(DIR, query.split()[0]) if not os.path.exists(DIR): os.mkdir(DIR) ###print images for i , (img , Type) in enumerate(ActualImages[0:5]): try: req = urllib2.Request(img, headers={'User-Agent' : header}) raw_img = urllib2.urlopen(req).read() cntr = len([i for i in os.listdir(DIR) if image_type in i]) + 1 print cntr if len(Type)==0: f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+".jpg"), 'wb') else : f = open(os.path.join(DIR , image_type + "_"+ str(cntr)+"."+Type), 'wb') f.write(raw_img) f.close() except Exception as e: print "could not load : "+img print e
There are no error logs, the file gets created and it is empty. The
ActualImages
array remains empty for some reason.