Selenium versus BeautifulSoup for web scraping

40,329

Solution 1

Before answering your question directly, it's worth saying as a starting point: if all you need to do is pull content from static HTML pages, you should probably use a HTTP library (like Requests or the built-in urllib.request) with lxml or BeautifulSoup, not Selenium (although Selenium will probably be adequate too). The advantages of not using Selenium needlessly:

  • Bandwidth, and time to run your script. Using Selenium means fetching all the resources that would normally be fetched when you visit a page in a browser - stylesheets, scripts, images, and so on. This is probably unnecessary.
  • Stability and ease of error recovery. Selenium can be a little fragile, in my experience - even with PhantomJS - and creating the architecture to kill a hung Selenium instance and create a new one is a little more irritating than setting up simple retry-on-exception logic when using requests.
  • Potentially, CPU and memory usage - depending upon the site you're crawling, and how many spider threads you're trying to run in parallel, it's conceivable that either DOM layout logic or JavaScript execution could get pretty expensive.

Note that a site requiring cookies to function isn't a reason to break out Selenium - you can easily create a URL-opening function that magically sets and sends cookies with HTTP requests using cookielib/cookiejar.

Okay, so why might you consider using Selenium? Pretty much entirely to handle the case where the content you want to crawl is being added to the page via JavaScript, rather than baked into the HTML. Even then, you might be able to get the data you want without breaking out the heavy machinery. Usually one of these scenarios applies:

  • JavaScript served with the page has the content already baked into it. The JavaScript is just there to do the templating or other DOM manipulation that puts the content into the page. In this case, you might want to see if there's an easy way to pull the content you're interested in straight out of the JavaScript using regex.
  • The JavaScript is hitting a web API to load content. In this case, consider if you can identify the relevant API URLs and just hit them yourself; this may be much simpler and more direct than actually running the JavaScript and scraping content off the web page.

If you do decide your situation merits using Selenium, use it in headless mode, which is supported by (at least) the Firefox and Chrome drivers. Web spidering doesn't ordinarily require actually graphically rendering the page, or using any browser-specific quirks or features, so a headless browser - with its lower CPU and memory cost and fewer moving parts to crash or hang - is ideal.

Solution 2

I would recommend using Selenium for things such as interacting with web pages whether it is in a full blown browser, or a browser in headless mode, such as headless Chrome. I would also like to say that beautiful soup is better for observing and writing statements that rely on if an element is found or WHAT is found, and then using selenium ot execute interactive tasks with the page if the user desires.

Share:
40,329
elie
Author by

elie

Updated on July 09, 2022

Comments

  • elie
    elie almost 2 years

    I'm scraping content from a website using Python. First I used BeautifulSoup and Mechanize on Python but I saw that the website had a button that created content via JavaScript so I decided to use Selenium.

    Given that I can find elements and get their content using Selenium with methods like driver.find_element_by_xpath, what reason is there to use BeautifulSoup when I could just use Selenium for everything?

    And in this particular case, I need to use Selenium to click on the JavaScript button so is it better to use Selenium to parse as well or should I use both Selenium and Beautiful Soup?

    • crowder
      crowder almost 11 years
      I've never used Selenium for parsing HTML before, but BeautifulSoup is quite robust. That said, since you're already using Selenium for something, why not use it for both until you know why you need something else?
    • elie
      elie almost 11 years
      @crowder yes for the moment I am doing everything with Selenium, but I used BS before so I would like to know which is really the best. Concerning the parsing, I think it is a little bit the same, as example, with Selenium we have driver.find_elements_by_class_name("...") and with BS we have soup.findAll('...', {'class':'....'}). I saw that BS is a little bit slowly, what do you think ?
    • Yi Zeng
      Yi Zeng almost 11 years
      @elie: It's really about what you are trying to achieve. Selenium is designed as web ui automation tool, it's an overkill if you just want to parse html. If you need some kind of interaction with the page, use Selenium.
    • elie
      elie almost 11 years
      @user1177636 thanks. For the moment i'm just parsing html and simulate click on javascript button to get all source code. Maybe after I will have some interaction between some pages but not sure for the moment. So if I'm just parsing, you think it is better to do it with BS and to use Selenium only for the javascript button ?
    • Corey Goldberg
      Corey Goldberg almost 6 years
      selenium launches a browser.. that's a lot of overhead for tasks that can be done without that.