Using BeautifulSoup where authentication is required

11,796

Solution 1

Have you considered using mechanise?

import mechanize
from bs4 import BeautifulSoup
import urllib2 
import cookielib

cook = cookielib.CookieJar()
req = mechanize.Browser()
req.set_cookiejar(cook)


req.open("http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1")

req.select_form(nr=0)
req.form['username'] = 'username'
req.form['password'] = 'password.'
req.submit()

print req.response().read()

EDIT

If you come up against robots.txt issues and you have permission to circumvent this then take a look at this answer for techniques to do this https://stackoverflow.com/questions/13303449/urllib2-httperror-http-error-403-forbidden

Solution 2

If you are using BeautifulSoup and requests on Python 3.x, just use this:

from bs4 import BeautifulSoup
import requests

r = requests.get('URL', auth=('USER_NAME', 'PASSWORD'))
soup = BeautifulSoup(r.content)
Share:
11,796

Related videos on Youtube

Pooja Prabhu
Author by

Pooja Prabhu

A newbie at coding! a freelancer by day- writer by night. Here to learn from every source possible!

Updated on June 04, 2022

Comments

  • Pooja Prabhu
    Pooja Prabhu almost 2 years

    I am scraping LAN data using BeautifulSoup4 and Python requests for a company project. Since the site has a login interface, I am not authorized to access the data. The login interface is a pop-up that doesn't allow me to access the page source or inspect the page elements without log in. the error I get is this-

    Access Error: Unauthorized Access to this document requires a User ID

    This is a screen-shot of the pop-up box (The blackened part is sensitive information). It has not information about the html tags at all, hence I cannot auto-login via python.

    I have tried requests_ntlm, selenium, python requests and even ParseHub but it did not work. I have been stuck in this phase for a month now! Please, any help would be appreciated.

    Below is my initial code:

    import requests
    from requests_ntlm import HttpNtlmAuth
    from bs4 import BeautifulSoup
    r = requests.get("www.amazon.in")
    from urllib.request import Request, urlopen
    req = Request('http://www.cmegroup.com/trading/products/#sortField=oi&sortAsc=false&venues=3&page=1&cleared=1&group=1', headers={'User-Agent': 'Mozilla/5.0'})
    webpage = urlopen(req).read()
    print r.content
    r = requests.get("www.amazon.in",auth=HttpNtlmAuth('user_name','passwd'))
    print r.content*
    
    s_data = BeautifulSoup(r.content,"lxml")*
    print s_data.content
    

    Error: Document Error: Unauthorized

    Access Error: Unauthorized

    Access to this document requires a User ID

    This is the error I get when BeautifulSoup tries to access the data after I have manually logged into the site.

    • Siva
      Siva over 6 years
      (unable-to-log-in-to-amazon-using-python) [stackoverflow.com/questions/36488023/…
    • Pooja Prabhu
      Pooja Prabhu over 6 years
      Clarification: The website url I provided is just a sample. I am not allowed to disclose the real one since its a private site.
    • Admin
      Admin over 6 years
      Using urllib and requests alone won't help much where logins are required. Use requests.Session. If you're unfamiliar with it, see docs.python-requests.org/en/master/user/advanced. Also, open the developer's tab in your browser and study the network window to see what requests are sent by the browser with what query strings and headers. Using User-Agent alone won't suffice in some cases. You'll have to provide a bit more information for us to help further. Perhaps, consider posting the contents of network tab and mask the sensitive information by replacing them with fake one.
    • Admin
      Admin over 6 years
      Also, BeautifulSoup has nothing to do with the problem you're facing. So consider editing your post to make it clearer.
  • Pooja Prabhu
    Pooja Prabhu over 6 years
    I tried it. The error I got is this- mechanize._response.httperror_seek_wrapper: HTTP Error 403: request disallowed by robots.txt
  • Paula Livingstone
    Paula Livingstone over 6 years
    Hmm ok well if you plan to monetise your work then you have problems scraping this site. How does their robots.txt read?
  • Pooja Prabhu
    Pooja Prabhu over 6 years
    I am not going to monetize this site. It is for inter-organizational use only. I do not have access to the robots.txt.