how to take all tweets in a hashtag with tweepy?

19,055

Solution 1

Sorry, I can't answer in comment, too long. :)

Sure :) Check this example: Advanced searched for #data keyword 2015 may - 2016 july Got this url: https://twitter.com/search?l=&q=%23data%20since%3A2015-05-01%20until%3A2016-07-31&src=typd

session = requests.session()
keyword = 'data'
date1 = '2015-05-01'
date2 = '2016-07-31'
session.get('https://twitter.com/search?l=&q=%23+keyword+%20since%3A+date1+%20until%3A+date2&src=typd', streaming = True)

Now we have all the requested tweets, Probably you could have problems with 'pagination' Pagination url ->

https://twitter.com/i/search/timeline?vertical=news&q=%23data%20since%3A2015-05-01%20until%3A2016-07-31&src=typd&include_available_features=1&include_entities=1&max_position=TWEET-759522481271078912-759538448860581892-BD1UO2FFu9QAAAAAAAAETAAAAAcAAAASAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA&reset_error_state=false

Probably you could put a random tweet id, or you can parse first, or requests some data from twitter. It can be done.

Use Chrome's networking tab to find all the requested information :)

Solution 2

Have a look at this: https://tweepy.readthedocs.io/en/v3.5.0/cursor_tutorial.html

And try this:

import tweepy

auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
api = tweepy.API(auth)

for tweet in tweepy.Cursor(api.search, q='#python', rpp=100).items():
    # Do something
    pass

In your case you have a max number of tweets to get, so as per the linked tutorial you could do:

import tweepy

MAX_TWEETS = 5000000000000000000000

auth = tweepy.OAuthHandler(CONSUMER_TOKEN, CONSUMER_SECRET)
api = tweepy.API(auth)

for tweet in tweepy.Cursor(api.search, q='#python', rpp=100).items(MAX_TWEETS):
    # Do something
    pass

If you want tweets after a given ID, you can also pass that argument.

Solution 3

This code worked for me.

import tweepy
import pandas as pd
import os

#Twitter Access
auth = tweepy.OAuthHandler( 'xxx','xxx')
auth.set_access_token('xxx-xxx','xxx')
api = tweepy.API(auth,wait_on_rate_limit = True)

df = pd.DataFrame(columns=['text', 'source', 'url'])
msgs = []
msg =[]

for tweet in tweepy.Cursor(api.search, q='#bmw', rpp=100).items(10):
    msg = [tweet.text, tweet.source, tweet.source_url] 
    msg = tuple(msg)                    
    msgs.append(msg)

df = pd.DataFrame(msgs)
Share:
19,055
shuetisha.dev
Author by

shuetisha.dev

Updated on June 17, 2022

Comments

  • shuetisha.dev
    shuetisha.dev almost 2 years

    I'm trying to take every open tweets in a hashtag but my code does not go further than 299 tweets.

    I also trying to take tweets from a specific time line like tweets only in May 2015 and July 2016. Are there any way to do it in the main process or should I write a little code for it?

    Here is my code:

    # if this is the first time, creates a new array which
    # will store max id of the tweets for each keyword
    if not os.path.isfile("max_ids.npy"):
        max_ids = np.empty(len(keywords))
        # every value is initialized as -1 in order to start from the beginning the first time program run
        max_ids.fill(-1)
    else:
        max_ids = np.load("max_ids.npy")  # loads the previous max ids
    
    # if there is any new keywords added, extends the max_ids array in order to correspond every keyword
    if len(keywords) > len(max_ids):
        new_indexes = np.empty(len(keywords) - len(max_ids))
        new_indexes.fill(-1)
        max_ids = np.append(arr=max_ids, values=new_indexes)
    
    count = 0
    for i in range(len(keywords)):
        since_date="2015-01-01"
        sinceId = None
        tweetCount = 0
        maxTweets = 5000000000000000000000  # maximum tweets to find per keyword
        tweetsPerQry = 100
        searchQuery = "#{0}".format(keywords[i])
        while tweetCount < maxTweets:
            if max_ids[i] < 0:
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry)
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                                since_id=sinceId)
            else:
                    if (not sinceId):
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                                max_id=str(max_ids - 1))
                    else:
                        new_tweets = api.search(q=searchQuery, count=tweetsPerQry,
                                                max_id=str(max_ids - 1),
                                                since_id=sinceId)
            if not new_tweets:
                print("Keyword: {0}      No more tweets found".format(searchQuery))
                break
            for tweet in new_tweets:
                count += 1
                print(count)
    
                file_write.write(
                           .
                           .
                           .
                             )
    
                item = {
                    .
                    .
                    .
                    .
                    .
                }
    
                # instead of using mongo's id for _id, using tweet's id
                raw_data = tweet._json
                raw_data["_id"] = tweet.id
                raw_data.pop("id", None)
    
                try:
                    db["Tweets"].insert_one(item)
                except pymongo.errors.DuplicateKeyError as e:
                    print("Already exists in 'Tweets' collection.")
                try:
                    db["RawTweets"].insert_one(raw_data)
                except pymongo.errors.DuplicateKeyError as e:
                    print("Already exists in 'RawTweets' collection.")
    
            tweetCount += len(new_tweets)
            print("Downloaded {0} tweets".format(tweetCount))
            max_ids[i] = new_tweets[-1].id
    
    np.save(arr=max_ids, file="max_ids.npy")  # saving in order to continue mining from where left next time program run