Clicking links with Python BeautifulSoup

15,220

Solution 1

So with help from the comments, I decided to just use urlopen like this:

from bs4 import BeautifulSoup
import urllib.request
import re

def getLinks(url):
    html_page = urllib.request.urlopen(url)
    soup = BeautifulSoup(html_page, "html.parser")
    links = []

    for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
        links.append(link.get('href'))
    return links

anchors = getLinks("http://madisonmemorial.org/")
for anchor in anchors:
    happens = urllib.request.urlopen(anchor)
    if happens.getcode() == "404":
        # Do stuff
# Click on links and return responses
countMe = len(anchors)
for anchor in anchors:
    i = getLinks(anchor)
    countMe += len(i)
    happens = urllib.request.urlopen(i)
    if happens.getcode() == "404":
        # Do some stuff

print(countMe)

I've got my own arguments in the if statements

Solution 2

Urlopen is a better solution for your purpose but if you need to click and interact with elements on the web I suggest using selenium webdriver. There are implementations for Java, Python, and other languages. I've used it with Java and Python, works pretty good. You can run it headless so the browser doesn't actually open.

pip install selenium

Solution 3

BeautifulSoup is merely a DOM/HTML Parser, it doesn't constitute a real or in your case emulated browser. For that purpose you could use Chrome or Selenium to emulate a real browser and crawl freely, which gives you the advantage of handling Javascript, however when that's not needed, you can use the widely available package requests to recursively crawl all links:

for link in links:
  body = requests.get(link).text
Share:
15,220

Related videos on Youtube

Adam McGurk
Author by

Adam McGurk

Just a guy who loves sports and finally found a true career passion in life...computers and all things technology! "Don't let a fifteen second decision affect the next fifteen years of your life."

Updated on June 04, 2022

Comments

  • Adam McGurk
    Adam McGurk almost 2 years

    So I'm new to Python (I come from a PHP/JavaScript background), but I just wanted to write a quick script that crawled a website and all children pages to find all a tags with href attributes, count how many there are and then click the link. I can count all of the links, but I can't figure out how to "click" the links and then return the response codes.

    from bs4 import BeautifulSoup
    import urllib2
    import re
    
    def getLinks(url):
        html_page = urllib2.urlopen(url)
        soup = BeautifulSoup(html_page, "html.parser")
        links = []
    
        for link in soup.findAll('a', attrs={'href': re.compile("^http://")}):
            links.append(link.get('href'))
        return links
    
    anchors = getLinks("http://madisonmemorial.org/")
    # Click on links and return responses
    countMe = len(anchors)
    for anchor in anchors:
        i = getLinks(anchor)
        countMe += len(i)
        # Click on links and return responses
    
    print countMe
    

    Is this even possible with BeautifulSoup?
    Also, I'm not looking for exact code, all I'm really looking for is like a point in the right direction for function calls to use or something like that. Thanks!

    • PRMoureu
      PRMoureu over 6 years
      i think you can't perform click actions with bs4, maybe take a look at selenium ? otherwise you can use urllib2.urlopen with the new links ?
    • Vinícius Figueiredo
      Vinícius Figueiredo over 6 years
      If you want to click on them just to get the response code you can just use urllib2.urlopen with the url in hand
  • innicoder
    innicoder over 6 years
    I agree with everything above that the mister said, I'd also like to add that what the asker might be looking for is working with the requests library, you can make all kind of requests ( get/post / del and such ) with the mentioned lib. However, this can't be done for JS (at least in a manner that's known to me). But let's say you want to register or log in or submit data. All of this can be done only with using requests or urllib and much faster than Selenium.
  • OneCricketeer
    OneCricketeer over 6 years
    I believe you're trying to do this github.com/jmcarp/robobrowser/blob/master/README.rst
  • OneCricketeer
    OneCricketeer over 6 years
    Scrapy is a more common web crawler, though