Collecting Twitter Data: Storing Tweets in MongoDB

📅January 30, 2015

In the first three sections of the Twitter data collection tutorial, I demonstrated how to collect tweets using both R and Python and how to store these tweets first as JSON files then having R parse them into a .csv file. The .csv file works well, but tweets don’t always make good flat .csv files, since not every tweet contains the same fields or the same structure. Some of the data is well nested into the JSON object. It is possible to write a parser that has a field for each possible subfield, but this might take a while to write and will create a rather large .csv file or SQL database.

MongoDB

Fortunately, NoSQL databases like MongoDB exist and it greatly simplifies tweet storage, search, and recall eliminating the need of a tweet parser. Installation and setup of MongoDB and the pymongo library is beyond the scope of this tutorial, but I can quickly explain what MongoDB does. It is a document-based database that uses documents instead of tuples in tables to store data. These documents look like just like JSON objects using key-value pairs, but they are called BSON [since it’s stored as binary]. From a programming prospective, they have similar properties as both JS objects and Python dictionaries.

Since JSON and BSON are so similar, storing a tweet in a MongoDB database is as easy as putting the entire content of the tweet’s JSON string into an insert statement. Recalling or searching the tweets is rather simple as well; it does require an OOP mindset over the traditional SQL command structure.

[I’m writing this from the perspective from ad hoc small-scale research. There might be performance issues that make other storage options much more desirable. Knowing the specific metadata from a tweet you want to keep will make any analysis faster or require less store space. MongoDB allows you to store all the information the API returns to you.]

Storing Tweets in MongoDB

I am going to assume that you have MongoDB running on your local computer for all the code examples.

Storing tweets is rather simple if you already have the Python stream listener built from Part III of the tutorial, since there are only a few changes to be made to the code. The first change will be calling the libraries: pymongo and json. The json library is available by default in Python, but you’ll have to install pymongo using pip install pymongo if you have the pip installer. The bulk of the changes will be in the listener child class.

from pymongo import MongoClient
import json


class listener(StreamListener):

def __init__(self, start_time, time_limit=60):

self.time = start_time
self.limit = time_limit

def on_data(self, data):

while (time.time() - self.time) < self.limit:

try:


client = MongoClient('localhost', 27017)
db = client['twitter_db']
collection = db['twitter_collection']
tweet = json.loads(data)

collection.insert(tweet)

return True


except BaseException, e:
print 'failed ondata,', str(e)
time.sleep(5)
pass

exit()

def on_error(self, status):
print statuses

The major change in the code includes:

client = MongoClient('localhost', 27017)
db = client['twitter_db']
collection = db['twitter_collection']
tweet = json.loads(data)

collection.insert(tweet)

MongoClient creates the MongoClient instance which will be used to interface with the database. The client[‘twitter_db’] call designates the database that is going to be used, and the db[‘twitter_collection’] call selects the collection where the documents will be stored. The json.loads() call converts the string returned from the Twitter API into a json object in Python. Finally, the collection.insert() call inserts the json object into the MongoDB database. From this rather simple change to the Python stream listener all the tweets can be saved into a MongoDB database.

Recalling Tweets from MongoDB

Recalling the tweets from MongoDB database is not too difficult if you understand the basics of Python for loops and dictionaries. The function to retrieve any documents from the database is collection.find(). You are able to specify what you want to find or leave it blank and get all the documents (tweets) returned. For this example, I’ll first just leave it blank to get all the tweets.

After calling the .find() method, Python will return a MongoDB cursor, which can be iterated through in a for loop. The for loop runs the loop for each object in the iterator. If you wanted to print the text from every tweet you would write:

tweets_iterator = collection.find()
for tweet in tweets_iterator:
  print tweet['text']

tweet contains one document [or in this case a tweet JSON object] in the sequence that tweet_iterator produces. The loop will change this document to another one until the for loop runs through every document in the iterator.

Since tweets in JSON format contain many subdocuments, it’s important to know what data you are looking and where to find it. The following code snippet is an example of different fields available to examine.

text = tweet['text']
user_screen_name = tweet['user']['screen_name']
user_name = tweet['user']['name']
retweet_count = tweet['retweeted_status']['retweet_count']
retweeted_name = tweet['retweeted_status']['user']['name']
retweeted_screen_name = tweet['retweeted_status']['user']['screen_name']

The last [‘field’] represents a property and any [‘fields’] before the last represents subdocuments. The text field is on the top level on any tweet document, this is the the text that is written in the tweet. There is a user subdocument with a lot of information in there. The code above pulls the screen_name and the user’s given name and content from retweeted tweets. If I were to retweet Barack Obama, you’d be to pull this data about Obama’s tweet from my retweet. I’ve used this to analyze retweet behavior.

Since MongoDB is a database, you are able to query it; you just can’t use SQL. The collection.find() is the method used for querying. Until now I’ve only used empty parameters in the .find() method to return the entire collection. Querying is done in a style similar to JSON.

To find an exact match to a string:

collection.find({'text' : 'This will return tweets with only this exact string.'})

The previous command will find only that exact string in a top level attribute. This isn’t helpful in a practical sense since exact searches aren’t very useful, but it’s the most basic find command. Having MongoDB pull twitter by a given user’s screen_name has some uses, but it is in a a subdocument so it requires some new syntax "document.subdocument":

tweets = collection.find({'user.screen_name' : 'exactScreenName'})

The above code will search the screen_name property in the user subdocument. Since I’ve shown exact searches, you can search for particular words using a regular expressions operator. This will search the text property to see if it can finds ‘word’ anywhere and return the entire tweet.

tweets = collection.find({'text': { '$regex' : 'word'}})

Since in the MongoDB the tweets might not have the same fields or properties, sometimes just searching to see if a property exists is useful. For example, if you wanted to find all the native retweets in your collection the following snippet is will return any tweet with a retweeted_status property. [The retweeted_status is typically a subdocument containing all the information about the retweeted tweet.]

collection.find({"retweeted_status" : { "$exists" : "true"}})

Conclusion

While using MongoDB has a learning curve, it can be rather useful to store data like tweets. It eliminates the need to write a parser since you effectively parse the data when you retrieve it. Knowing the subdocument structure of the documents in your database and thinking like a programming rather than a SQL database user will help you successful execute analyses in Python using MongoDB for Twitter data.