Collecting Twitter Data: Getting Started

📅January 24, 2015

The R code used in this post can be found on my git-hub.

After getting R, Python or whatever programming language you prefer, the next steps will require API keys from Twitter. This requires you have to have a Twitter account and to create an ‘application’ using the following steps.

Getting API Keys

Sign into Twitter
Go to https://apps.twitter.com/app/new and create a new application
Click on “Keys and Access Tokens” on the your application’s page
Get and copy your Consumer Key, Consumer Secret Key, Access Token, and Secret Token

Those four complex strings of case-sensitive letters and numbers are your API keys. Keep them secret, because they are more powerful than your Twitter password. If you are wondering what the keys are for, they are really two pairs of keys consisting of secret and non-secret, and this is done for security purposes. The consumer key pair authorizes your program to use the Twitter API, and the access token essentially signs you in as your specific Twitter user account. This framework makes more sense in the context of third party Twitter developers like TweetDeck where the application is making API calls but it needs access to each user’s personal data to write tweets, access their timelines, etc.

Getting Started in R

If you don’t have a preference for a certain programming environment, I recommend that people with less programming experience start with R for tweet scraping since it is simpler to collect and parse the data without having to understand much programming. The Streaming API authentication I use in R is slightly more complicated than what I normally do with Python. If you feel comfortable with Python, I recommend using the tweepy package for Python. It’s more robust than R’s streamR but has a steeper learning curve.

First, like most R scripts, the libraries need to be installed and called. Hopefully you already installed if not the install.packages commands are commented out for reference.

#install.packages("streamR")
#install.packages("ROAuth")
library(ROAuth)
library(streamR)

The first part of the actually code for a Twitter scraper will use the API keys obtained from Twitter’s development website. You insert your personal API keys where the **KEY** is in the code. For this method of authentication in R it only uses the CONSUMER KEY and CONSUMER SECRET KEY and it gets your ACCESS TOKEN from a PIN number from using an web address you open in your browser.

#create your OAuth credential
credential <- OAuthFactory$new(consumerKey='**CONSUMER KEY**',
                         consumerSecret='**CONSUMER SECRETY KEY**',
                         requestURL='https://api.twitter.com/oauth/request_token',
                         accessURL='https://api.twitter.com/oauth/access_token',
                         authURL='https://api.twitter.com/oauth/authorize')

#authentication process
options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
download.file(url="http://curl.haxx.se/ca/cacert.pem", destfile="cacert.pem")
credential$handshake(cainfo="cacert.pem")

After this is executed properly, R will give you output in your console that looks like the following:

Copy the https:// URL into a browser
Log into Twitter if you haven't already
Authorize the application
Then you'll get the PIN number to copy into the R console and hit Enter

Now that the authentication handshake was completed, the R program is able to use those credentials to make API calls. A basic call using the Streaming API is the filterStream() function in the streamR package. This will connected you to Twitter's stream for a designated amount of time and/or for a certain number of tweets collected.

#function to actually scrape Twitter
filterStream( file.name="tweets_test.json",
             track="twitter", tweets=1000, oauth=cred, timeout=10, lang='en' )

The track parameter tells Twitter want you want to 'search' for. It's technically not really a search since you are filtering the Twitter stream and not searching...technically. Twitter's dev site has a nice explanation of all the Streaming APIs parameters. For example, the track parameter is not case sensitive, it will treat hashtags and regular words the same, and it will find tweets with any of the words you specify, not just when all the words are present. The track parameter 'apple, twitter' will find tweets with 'apple', tweets with 'twitter', and tweets with both.

The filterStream() function will stay open as long as you tell it to in the timeout parameter [in seconds], so don't set it too long if you want your data quickly. The data Twitter returns to you is a .json file, which is a JavaScript data file.

The above is an excerpt from a tweet that's been formatted to be easier to read. Here's a larger annotated version of a tweet JSON file. Thes are useful in some contexts of programming, but for basic use in R, Tableau, and Excel it's gibberish.

There are a few different ways to parse the data into something useful. The most basic [and easiest] is to use the parseTweets() function that is also in streamR.

#Parses the tweets
tweet_df <- parseTweets(tweets='tweets_test.json')

This is a pretty simple function that takes the JSON file that filterStream() produced, reads it, and creates a wide data frame. The data frame can be pretty daunting, since there is so much metadata available.

You might notice some of the ?-mark characters. These are text encoding errors. This is one of the limitations of using R to parse the tweets, because the streamR package doesn't handle utf-8 characters well in its functions. This means that R can only read basic A-Z characters and can't translate emoji, foreign languages, and some punctuation. I'd recommend using something like MongoDB to store tweets or create your own parser if you want be able to use these features of the text.

Quick Analysis

This tutorial focuses on how to collect Twitter data and not the intricacies of analyzing it, but here are a few simple examples of how you can use the tweet data frame.

#using the Twitter data frame
tweet_df$created_at
tweet_df$text


plot(tweet_df$friends_count, tweet_df$followers_count) #plots scatterplot
cor(tweet_df$friends_count, tweet_df$followers_count) #returns the correlation coefficient

The different columns within the data frame can be called separately. Calling the created_at field gives you the tweet's time stamp, and the text field is the content of the tweet. Generally, there will be some correlation between the number of followers a person has [followers_count] and the number of accounts a person follows [friends_count]. When I ran my script I got a correlation of about 0.25. The scatter plot will be heavily impacted by the Justin Biebers of the world where they have millions of followers but follow only a few themselves.

Conclusion

This is quick-start tutorial to collecting Twitter data. There are plenty of resources to be found on Twitter's developer site and all over the internet. While this tutorial is useful to learn the basics of how the OAuth process works and how Twitter returns data, I recommend using a tool like Python and MongoDB which can give you greater flexibility for analysis. Collecting tweets is the foundation of using Twitter's API, but you can also get user objects, trends, or accomplish anything that you can in a Twitter client with the REST and Search APIs.

The R code used in this post can be found on my git-hub.