Collecting Twitter Data: Using a Python Stream Listener
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8
I use the term stream listener [2 words] to refer to program build with this code and StreamListener [1 word] to refer to the specific class from the tweepy package. The two are related but not the same. The StreamListener class makes the stream listener program what it is, but the program entails more than the class.
While using R and its streamR package to scrape Twitter data works well, Python allows more customization than R does. It also has a steeper learning curve, because the coding is more invovled. Before using Python to scrape Twitter data, a software package like tweepy must be installed. If you have the pip installer installed on your system, the installation procedure is rather easy and executed in the Terminal.
Call Tweepy Library
Terminal:
$ pip install tweepy
After the software package is installed, you can start writing a stream listener script. First, the libraries have to be imported.
import time from tweepy import Stream from tweepy import OAuthHandler from tweepy.streaming import StreamListener import os
The three tweepy class imports will be used to construct the stream listener, the time library will be used create a time-out feature for the script, and the os library will be used to set your working directory.
Set Variables Values
Before diving into constructing the stream listener, let’s set some variables. These variables will be used in the stream listener by being feed into the tweepy objects. I code them as variables instead of directly into the functions so that they can be easily changed.
ckey = '**CONSUMER KEY**' consumer_secret = '**CONSUMER SECRET KEY***' access_token_key = '**ACCESS TOKEN**' access_token_secret = '**ACCESS TOKEN SECRET**' start_time = time.time() #grabs the system time keyword_list = ['twitter'] #track list
Using and Modifying the Tweepy Classes
I believe that tweet scraping with Python has a steeper learner curve than with R, because Python is dependent on combining instances of different classes. If you don’t understand the basics of object-oriented programming, it might be difficult to comprehend what the code is accomplishing or how to manipulate the code. The code I show in this post does the following:
- Creates an OAuthHandler instance to handle OAuth credentials
- Creates a listener instance with a start time and time limit parameters passed to it
- Creates an StreamListener instance with the OAuthHandler instance and the listener instance
Before these instances are created, we have to “modify” the StreamListener class by creating a child class to output the data into a .csv file.
#Listener Class Override class listener(StreamListener): def __init__(self, start_time, time_limit=60): self.time = start_time self.limit = time_limit self.tweet_data = [] def on_data(self, data): saveFile = io.open('raw_tweets.json', 'a', encoding='utf-8') while (time.time() - self.time) < self.limit: try: self.tweet_data.append(data) return True except BaseException, e: print 'failed ondata,', str(e) time.sleep(5) pass saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8') saveFile.write(u'[\n') saveFile.write(','.join(self.tweet_data)) saveFile.write(u'\n]') saveFile.close() exit() def on_error(self, status): print statuses
This is the most complicated section of this code. The code rewrite the actions taken when the StreamListener instance receives data [the tweet JSON].
saveFile = io.open('raw_tweets.json', 'w', encoding='utf-8') saveFile.write(u'[\n') saveFile.write(','.join(self.tweet_data)) saveFile.write(u'\n]') saveFile.close()
This block of code opens an output file, writes the opening square bracket, writes the JSON data as text separated by commas, then inserts a closing square bracket, and closes the document. This is the standard JSON format with each Twitter object acting as an element in a JavaScript array. If you bring this into R or Python built-in parser and the json
library can properly handle it.
This section can be modified to or modify the JSON file. For example you can place other properties/fields like a UNIX time stamp or a random variable into the JSON. You can also modified the output file or eliminate the need for a .csv file and insert the tweet directly into a MongoDB database. As it is written, this will produce a file that can be parsed by Python's json
class.
After the child class is created we can create the instances and start the stream listener.
auth = OAuthHandler(ckey, consumer_secret) #OAuth object auth.set_access_token(access_token_key, access_token_secret) twitterStream = Stream(auth, listener(start_time, time_limit=20)) #initialize Stream object with a time out limit twitterStream.filter(track=keyword_list, languages=['en']) #call the filter method to run the Stream Object
Here the OAuthHandler uses your API keys [consumer key & consumer secret key] to create the auth object. The access token, which is unique to an individual user [not an application], is set in the following line. Unlike the filterStream() function in R, this will take all four of your credentials from the Twitter Dev site. The modified StreamListener class simply called listener is used to create an listener instance. This contains the information about what to do with the data once it comes back from the Twitter API call. Both the listener and auth instances are used to create the Stream instance which combines the authentication credentials with the instructions on what to do with the retrieved data. The Stream class also contains a method for filtering the Twitter Stream. This method works just like the R filterStream() function taking similar parameters, because the parameters are passed to the Stream API call.
Python vs R
At this stage in the tutorial, I would recommend parsing this data using the parser in R from the last section of the Twitter tutorial or creating your own. Since it's easier to customize the StreamListener methods in Python, I prefer to use it over other R. Generally, I think Python works better for collecting and processing data, but isn't as easy to use for most statistical analysis. Since tweet scraping would fall into the data collection category, I like Python. It becomes easier to access databases and to manipulate the data when you are already working in Python.
11-10-2015 -- I've updated the StreamListener to output properly formatted JSON. The old script which works well with R's tweetParse is still available on my GitHub.
Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV -- Errors | Part VI: Twitter JSON to CSV -- ASCII | Part VII: Twitter JSON to CSV -- UTF-8