AttributeError: 'NoneType' object has no attribute 'strip' with Python WebCrawler

22,849

Solution 1

When you do

request = urllib2.Request(new_url)

in crawl(), new_url is None. As you're getting new_url from get_more_tweets(new_soup), that means get_more_tweets() is returning None.

That means return d is never being reached, which means either str(b) == 'more' was never true, or soup.findAll() didn't return any links so for link in links does nothing.

Solution 2

AttributeError: 'NoneType' object has no attribute 'strip'

It means exactly what it says: url.strip() requires first figuring out what url.strip is, i.e. looking up the strip attribute of url. This failed because url is a 'NoneType' object, i.e. an object whose type is NoneType, i.e. the special object None.

Presumably url was expected to be a str, i.e. a text string, since those do have a strip attribute.

This happened within File "C:\Python28\lib\urllib.py", i.e., the urllib module. That's not your code, so we look backwards through the exception trace until we find something we wrote: request = urllib2.Request(new_url). We can only presume that the new_url that we pass to the urllib2 module eventually becomes a url variable somewhere within urllib.

So where did new_url come from? We look up the line of code in question (notice that there is a line number in the exception traceback), and we see that the immediately previous line is new_url = get_more_tweets(new_soup), so we're using the result for get_more_tweets.

An analysis of this function shows that it searches through some links, tries to find one labelled 'more', and gives us the URL for the first such link that it finds. The case we haven't considered is when there are no such links. In this case, the function just reaches the end, and implicitly returns None (that's how Python handles functions that reach the end without an explicit return, since there is no specification of a return type in Python and since a value must always be returned), which is where that value is coming from.

Presumably, if there is no 'more' link, then we should not be attempting to follow the link at all. Therefore, we fix the error by explicitly checking for this None return value, and skipping the urllib2.Request in that case, since there is no link to follow.

By the way, this None value would be a more idiomatic "placeholder" value for the not-yet-determined currenttime than the False value that you are currently using. You might also consider being a little more consistent about separating words with underscores in your variable and method names to make things easier to read. :)

Share:
22,849
snehoozle
Author by

snehoozle

Updated on November 21, 2022

Comments

  • snehoozle
    snehoozle over 1 year

    I'm writing a python program to crawl twitter using a combination of urllib2, the python twitter wrapper for the api, and BeautifulSoup. However, when I run my program, I get an error of the following type:

    ray_krueger RafaelNadal

    Traceback (most recent call last):
      File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 78, in <module>
        crawl(start_follower, output, depth)
      File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
        crawl(y, output, in_depth - 1)
      File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 74, in crawl
        crawl(y, output, in_depth - 1)
      File "C:\Users\Public\Documents\Columbia Job\Python Crawler\Twitter Crawler\crawlerversion9.py", line 64, in crawl
        request = urllib2.Request(new_url)
      File "C:\Python28\lib\urllib2.py", line 192, in __init__
        self.__original = unwrap(url)
      File "C:\Python28\lib\urllib.py", line 1038, in unwrap
        url = url.strip()
    AttributeError: 'NoneType' object has no attribute 'strip'
    

    I'm completely unfamiliar with this type of error (new to python) and searching for it online has yielded very little information. I've attached my code as well, but do you have any suggestions?

    Thanx Snehizzy

    import twitter
    import urllib
    import urllib2
    import htmllib
    from BeautifulSoup import BeautifulSoup
    import re
    
    start_follower = "NYTimeskrugman" 
    depth = 3
    output = open(r'C:\Python27\outputtest.txt', 'a') #better to use SQL database thanthis
    
    api = twitter.Api()
    
    #want to also begin entire crawl with some sort of authentication service 
    
    def site(follower):
        followersite = "http://mobile.twitter.com/" + follower
        return followersite
    
    def getPage(follower): 
        thisfollowersite = site(follower)
        request = urllib2.Request(thisfollowersite)
        response = urllib2.urlopen(request)
        return response
    
    def getSoup(response): 
        html = response.read()
        soup = BeautifulSoup(html)
        return soup
    
    def get_more_tweets(soup): 
        links = soup.findAll('a', {'href': True}, {id : 'more_link'})
        for link in links:
            b = link.renderContents()
            if str(b) == 'more':
                c = link['href']
                d = 'http://mobile.twitter.com' +c
                return d
    
    def recordlinks(soup,output):
        tags = soup.findAll('div', {'class' : "list-tweet"})#to obtain tweet of a follower
        for tag in tags: 
            a = tag.renderContents()
            b = str (a)
            output.write(b)
            output.write('\n\n')
    
    def checkforstamp(soup):
        times = nsoup.findAll('a', {'href': True}, {'class': 'status_link'})
        for time in times:
            stamp = time.renderContents()
            if str(stamp) == '3 months ago':
                return True
    
    def crawl(follower, output, in_depth):
        if in_depth > 0:
            output.write(follower)
            a = getPage(follower)
            new_soup = getSoup(a)
            recordlinks(new_soup, output)
            currenttime = False 
            while currenttime == False:
                new_url = get_more_tweets(new_soup)
                request = urllib2.Request(new_url)
                response = urllib2.urlopen(request)
                new_soup = getSoup(response)
                recordlinks(new_soup, output)
                currenttime = checkforstamp(new_soup)
            users = api.GetFriends(follower)
            for u in users[0:5]:
                x = u.screen_name 
                y = str(x)
                print y
                crawl(y, output, in_depth - 1)
                output.write('\n\n')
            output.write('\n\n\n')
    
    crawl(start_follower, output, depth)
    print("Program done. Look at output file.")
    
    • snehoozle
      snehoozle over 12 years
      The crawler essentially works by first identifying a follower and using beautiful soup to parse his/her page until I run into tweets that are 3 months old. Then it goes to the first five followers of each follower and so on - repeating the same process until it hits the depth that I specified.
  • snehoozle
    snehoozle over 12 years
    Thanks! I just realized that the way I wrote my code - I assumed each twitter user would have more than 1 page of tweets. However this does not appear to be the case for the 4th person I hit after crawling the tweets of the first three. Hence, when I get to that 4th user and my crawler attempts to find the link "more" which provides more tweets, it doesn't. It then returns None which causes the ultimate error. I'll try taking this into account in my code and keep you updated.
  • snehoozle
    snehoozle over 12 years
    Scratch that. I just realized it was the second user - Rafael Nadal who was new to twitter and hence only had 1 page of tweets...Ha!