Category Archives: twitter

Collecting Twitter Data: Converting Twitter JSON to CSV — UTF-8

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]


The main drawback to the ASCII CSV parser and the csv library and is that it can’t handle unicode characters or objects. I want to be able to make a csv file that is encoding in UTF-8, so that will have to be done from scratch. The basic structure follows the previous ASCII post so the json Python object description can be found on the previous tutorial.

io.open

First, to handle the UTF-8 encoding, I used the io.open class. For the sake of consistency, I used this class for both reading the JSON file and writing the CSV file. This actually doesn’t require much change to the structure of the program, but it’s an important change. The json.loads() reads the JSON data and parses it into an object you can access like a Python dictionary.

Unicode Object Instead of List

Since this program uses the write() method instead of a csv.writerow() method, and the write() method requires a string or in this case a unicode object instead of a list. Commas have to be manually inserted into the string to properly. For the field names, I just rewrote the line of code to be a unicode string instead of the list used for the ASCII parser. The u'*string*' is the syntax for a unicode string, which behave similarly to normal strings, but they are different. Using the wrong type of string can cause compatibly issues. The line of code that uses the u'\n' creates a new line in the CSV. Once again this is need in this parser needs to insert the new line character to create a new line in the CSV file.

The for loop and Delimiters

This might be the biggest change relative to the ASCII program. Since this is a CSV parser made from scratch, the delimiters have to be programmed in. For this flavor of CSV, it will have the text field entirely enclosed by quotation marks (") and use commas (,) to separate the different fields. To account for the possibility of having quotation marks in the actual text content, any real quotation marks will be designated by double quotes (""). This can give rise to triple quotes, which happens if a quotation mark starts or ends a tweet’s text field.

This parser implements the delimiters requirements of the text fields by

  1. Replacing all quotation marks with double quotes in the text.
  2. Adding quotation marks to the beginning and end of the unicode string

Joining the row list using a comma as a separator is a quick way to write the unicode string for the line of the CSV file.

The full code I used in this tutorial can be found on my GitHub .


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]

Collecting Twitter Data: Converting Twitter JSON to CSV — ASCII

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8


I outlined some of the potential hurdles that you have to overcome when converting Twitter JSON data to a CSV file in the previous section. Here I outline a quick Python script that allows you to parse your Twitter JSON file with the csv library. This has the obvious drawback in that it can’t handle the utf-8 encoded characters that can be present in tweets. But this program will produce a CSV file that will work well in Excel or other programs that are limited to ASCII characters.

The JSON File

The first requirement is to have a valid JSON file. This file should contain an array of Twitter JSON objects, or in analogous Python terms a list of Twitter dictionaries. The tutorial for the Python Stream Listener has been updated to make the correctly formatted file to work in Python.

The JSON file is loaded into Python and is automatically parsed into a Python friendly object by the json library using the json.loads() method. This opens and reads the file in as a string in the open() line, then decodes the string into a json Python object which behaves similar to a list of Python dictionaries — one dictionary for each tweet.

The CSV Writer

Before getting too ahead of things, a CSV writer should create a file and write the first row to label the data columns. The open() line creates a file and allows Python to write to it. This is a generic file, so anything could be written to it. The csv.writer() line creates an object which will write CSV formatted text to file we just opened. There are some other parameters you are able to specify, but it defaults to Excel specifications, so it those options can be omitted.

The purpose of this parser is to get some really basic information from the tweets, so it will only get the date and time, text, screen name and the number of followers, friends, retweets and favorites [which are called likes now]. If you wanted to retrieve other information, you’d would create the column names accordingly. the writerow() method writes a list with each element being a value which is separated by the comma in the CSV file.

The json Python object can be used in a for loop to access the individual tweets. From there each line can be accessed to get the different variables we are interested in. I’ve condensed the code so that is all in one statement. Breaking it down the line.get('*attribute*') retrieves the relevant information from the tweet. The line represents an individual tweet.

You might not notice this line, but it’s critical for this program working.

If the encode() method isn’t included, unicode characters (like emojis) are included in their native encoding. This will be sent to the csv.writer object, which can’t handle those characters and fail. This would be necessary for any field that could possibly have a unicode character. I know the other fields I chose cannot have non-ASCII characters, but if you were to add name or description, you’d have to make sure they do not have incompatible characters.

The unicode escape rewrites the unicode as a string of letters and number much like \U0001f35f. These represent the characters and can actually be decoded later.

The full code I used in this tutorial can be found on my GitHub .


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII [current page] | Part VII: Twitter JSON to CSV — UTF-8

Collecting Twitter Data: Converting Twitter JSON to CSV — Possible Errors

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


ASCII JSON-to-CSV | UTF-8 JSON-to-CSV

Before diving into the problem of how to save tweets in a CSV file, let me say there are a 1,000 ways to do this and about 100 complications that arise depending which way you want to accomplish this. I will devote two posts which covers using both ASCII and UTF-8 encoding because many tweets contain characters beyond the normal Latin alphabet.

Let’s look at some of the issues with writing CSV from tweets.

  • Tweets are JSON and contain a massive amount of metadata. More than you probably want.
  • The JSON isn’t a flat structure; it has levels. [Direct contrast to a CSV file.]
  • The JSON files don’t all have the same elements.
  • There are many foreign languages and emoji used in tweets.
  • Tweets contain many different grammatical marks such as commas and quotation marks.

These issues aren’t incredibly daunting, but those unfamiliar will encounter frustrating errors.

Tweets are JSON and contain a massive amount of metadata. More than you probably want.

I’m always in favor of keeping as much data as possible, but tweets contain a massive amount of different metadata attributes. All of these are designed for the Twitter platform and for the associated client apps. Some items like the profile_background_image_url_https really don’t have much of an impact on any analysis. Choosing which attributes you want to keep will be critical before embarking on a process to parse the data into a CSV. There’s a lot to choose from: timestamp data, user data, retweet data, geocoding data, hashtag data and link data.

The JSON isn’t a flat structure; it has levels.

This issue is an extension of the previous issue, since tweet JSON data isn’t organized into a flat, spreadsheet-like structure. The created_at and text elements are located on the top level and are easy to access, but something as simple as the tweeter’s name and screen_name are located in the user nested object. Like everything else mentioned in this post, this isn’t a huge issue, but the structure of a tweet JSON file has to be considered when coding your program.

The JSON files don’t all have the same elements.

The final problem with JSON files is the fields aren’t necessarily present in every object. Many geo related attributes do not appear unless geotagging is enabled. This means if you write your program to look for geotagging data, it can throw a key error if those keys don’t exist in that specific tweet. To avoid this you have to account for the exception or use a method that already does that. I use the get() method to avoid these key errors in the CSV parser.

There are many foreign languages and emoji used in tweets.

I quickly addressed this issue in a few posts, and it’s one of the reasons why I like to store tweets in MongoDB. Tweets contain a lot of of [read: important] unicode characters. These are typically many foreign language characters and the ubiquitous emojis. This is important because the presence of UTF-8 unicode characters can and will cause encoding errors when parser a file or loading a file into Excel. Excel (at least the version on my computer) can’t handle these characters. Other tools like the built-in CSV writer in Python can’t handle unicode out of box. Being able to deal with these characters is critical to compatibility with other software as long as the integrity of your data.

This issue forces me to write two different parsers for examples. I have a CSV parser that outputs ASCII that imports well into Excel along with a UTF-8 version which allows you to natively save the characters and emojis in a human-readable CSV file.

Tweets contain many different grammatical marks such as commas and quotation marks.

This is a problem that I had when I first started working with Twitter data and tried to write my own parser — characters that are part of your text content sometimes get confused with the delimiters. In this case I’m talking about quotation marks (") and commas (,). Comma sseparate the values for each ‘cell’, hence the acronym CSV. If you tweet you’ve probably tweeted using one of these characters. I’ve stripped them out of the text previously to solve this problem, but that’s not a great solution. The way Excel handles this is to enclose any elements that contain commas with quotation marks then to use double quotation marks to signify an actual quotation mark and not enclosed text. This will be demonstrated in the UTF-8 parser since I made that from scratch.


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors [current page] | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

emoji

Emoji iOS 9.1 Update — The Taco Emoji Analysis

Before I get too far I don’t actually analysis taco emojis. At least not yet. I, however, give you the tools to start parsing them from tweets, text or anything you can get into Python.

This past month Apple released their iOS 9.1 and their latest OS X 10.11.1 El Capitan update. That updated included a bunch of new emojis. I’ve made a quick primer on how to handle emoji analysis in Python. Then when Apple released an update to their emojis to include the diversity, I updated my small Python class for emoji counting to include to the newest emojis. I also looked at what is actually happening with the unicode when diversity modifier patches are used.

Click for Updated socialmediaparse Library

With this latest update, Apple and the Unicode Consortium didn’t really introduce any new concepts, but I did update the Python class to include the newest emojis. In my GitHub the data folder includes a text file with all the emojis delimitated by ‘\n’. The class uses this file to find any emoji’s in a unicode string which has been passed to the add_emoji_count() method.

Building off of the diversity emoji update, I added a skin_tone_dict property of the EmojiDict class. This property returns a dictionary with the number of unique human emojis per tweet and their skin tones. This property will not catch multiple human emojis written if they in the same execution of the add_emoji_count() method

Above is an example of how to use the new attribute. It is a dictionary so you can work that into your analysis however you like. I will eventually create better methods and outputs to make this feature more robust and useful.

The full code / class I used in this post can be found on my GitHub .

Baseball Twitter Roller Coaster

Because Twitter is fun and so are graphs, I have tweet volume graphs from my Twitter scraper that collects tweets with the team-specific nicknames and Twitter handles. After a trade (or non-trade), the data can be collected and a graphical picture of the reaction can be produced. The graph represents the volume of sampled tweets that contained the specific keywords: Mets, Gomez, Flores and Hamels. “Tears” is a collection of any tweets which mentioned either “tears” or “crying”, since Flores was in tears as he took the field.

Here are the reactions to the Gomez non-trade and Hamels trade last night:

Mets

And here’s the timeline of necessary tweets:
[All times are EDT.]


July 29
9:00 PM


9:45 PM


9:54 PM



10:15 PM



10:55 PM



July 30
12:13 AM

Some of the times were rounded if there wasn’t a clear single tweet that caused the peak on Twitter.

Twitter Sentiment — Penguins VS. Rangers Gm 4

Game 4 of the Penguins-Rangers series featured a brief overtime period that overshadowed the rest of the game as far as tweet volume goes. Rangers fans were more negative at the beginning of the game after the Penguins scored their first goal. Twitter volume picked up for both teams during the overtime period and Rangers fans’ tweets spiked when they won the game and continued throughout the night.

Gm4 Sentiment

Penguins Rangers 2015 Game 3 Sentiment

Twitter Sentiment — Penguins vs. Rangers Gm 3

Unfortunately, my Twitter scraper wasn’t looking for the most viral story of the Penguin’s loss to the Rangers. [I’m not linking to it, but it involves a columnist and the Penguins’ GM.] I was able to get general sentiment over the course of the game. There isn’t too much to analyze. There are more Rangers tweets overall, most likely due to increased interest and a larger market, and I’ve annotated where there was scoring during the game. I’ll probably have a few more updates through out the playoffs.

Penguins Rangers 2015 Game 3 Sentiment

Using New, Diverse Emojis for Analysis in Python

I haven’t been updating this site often since I’ve started to perform a similar job over at FanGraphs. All non-baseball stat work that I do will continued to be housed here.

Over the past week, Apple has implemented new emojis with a focus on diversity in their iOS 8.3 and the OS X 10.10.3 update. I’ve written quite a bit about the underpinnings of emojis and how to get Python to run text analytics on them. The new emojis provide another opportunity to gain insights on how people interact, feel, or use them. Like always, I prefer to use Python for any web scraping or data processing, and emoji processing is no exception. I already wrote a basic primer on how to get Python to find emoji in your text. If you combine the tutorials I have for tweet scraping, MongoDB and emoji analysis, you have yourself a really nice suite of data analysis.

Modifier Patch

These new emojis are a product of the Unicode Consortium’s plan for how to incorporate racial diversity into the previously all-white human emoji line up. (And yes, there’s a consortium for emoji planning.) The method used to produce new emojis isn’t quite as simple as just making a new character/emoji. Instead, they decided to include a modifier patch at the end of human emojis to indicate skin color. As a end-user, this won’t affect you if you have all the software updates and your device can render the new emojis. However, if you don’t have the updates, you’ll get something that looks like this:

Emoji Patch Error
That box at the end of the emoji is the modifier patch. Essentially what is happening here is that there is a default emoji (in this case it’s the old man) and the modifier patch (the box). For older systems it doesn’t display, because the old system doesn’t know how to interpret this new data. This method actually allows the emojis to be backwards compatible, since it still conveys at least part of the meaning of the emoji. If you have the new updates, you will see the top row of emoji.

Emoji Plus Modifier Patches

Using a little manipulation (copying and pasting) using my newly updated iPhone we can figure out this is what really is going on for emojis. There are five skin color patches available to be added to each emoji, which is demonstrated on the bottom row of emoji. Now you might notice there are a lot of yellow emoji. Yellow emojis (Simpsons) are now the default. This is so that no single real skin tone is the default. The yellow emojis have no modifier patch attached to them, so if you simply upgrade your phone and computer and then go back and look at old texts, all the emojis with people in them are now yellow.

New Families

The new emoji update also includes new families. These are also a little different, since they are essentially combinations of other emoji. The original family emoji is one single emoji, but the new families with multiple children and various combinations of children and partners contain multiple emojis. The graphic below demonstrates this.

Emoji New Familes

The man, woman, girl and boy emoji are combined to form that specific family emoji. I’ve seen criticisms about the families not being multiracial. I’d have to believe the limitation here is a technical one, since I don’t believe the Unicode consortium has an effective method to apply modifier patches and combine multiple emojis at once. That would result in a unmanageable number of glyphs in the font set to represent the characters. (625 different combinations for just one given family of 4, and there are many different families with different gender iterations.)

New Analysis

So now that we have the background on the how the new emojis work, we can update how we’ve searched and analyzed them. I have updated my emoji .csv file, so that anyone can download that and run a basic search within your text corpus. I have also updated my github to have this file as well for the socialmediaparse library I built.

The modifier patches are searchable, so now you can search for certain swatches (or lack there of). Below I have written out the unicode escape output for the default (yellow) man emoji and its light-skinned variation. The emoji with a human skin color has that extra piece of code at the end.

Here are all the modifier patches as unicode escape.

Emoji Modifier Patches

The easiest way to search for these is to use the following snippet of code:

You can throw that snippet into a for loop for a Pandas data frame or a MongoDB cursor. I’m planning on updating my socialmediaparse library with patch searching, and I’ll update this post when I do that.

Spock

Finally, there’s Spock!

Emoji Spock

The unicode escape for Spock is:

Add your modifier patches as needed.

Collecting Twitter Data: Storing Tweets in MongoDB

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB [current page] | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


In the first three sections of the Twitter data collection tutorial, I demonstrated how to collect tweets using both R and Python and how to store these tweets first as JSON files then having R parse them into a .csv file. The .csv file works well, but tweets don’t always make good flat .csv files, since not every tweet contains the same fields or the same structure. Some of the data is well nested into the JSON object. It is possible to write a parser that has a field for each possible subfield, but this might take a while to write and will create a rather large .csv file or SQL database.

MongoDB

Fortunately, NoSQL databases like MongoDB exist and it greatly simplifies tweet storage, search, and recall eliminating the need of a tweet parser. Installation and setup of MongoDB and the pymongo library is beyond the scope of this tutorial, but I can quickly explain what MongoDB does. It is a document-based database that uses documents instead of tuples in tables to store data. These documents look like just like JSON objects using key-value pairs, but they are called BSON [since it’s stored as binary]. From a programming prospective, they have similar properties as both JS objects and Python dictionaries.

Since JSON and BSON are so similar, storing a tweet in a MongoDB database is as easy as putting the entire content of the tweet’s JSON string into an insert statement. Recalling or searching the tweets is rather simple as well; it does require an OOP mindset over the traditional SQL command structure.

[I’m writing this from the perspective from ad hoc small-scale research. There might be performance issues that make other storage options much more desirable. Knowing the specific metadata from a tweet you want to keep will make any analysis faster or require less store space. MongoDB allows you to store all the information the API returns to you.]

Storing Tweets in MongoDB

I am going to assume that you have MongoDB running on your local computer for all the code examples.

Storing tweets is rather simple if you already have the Python stream listener built from Part III of the tutorial, since there are only a few changes to be made to the code. The first change will be calling the libraries: pymongo and json. The json library is available by default in Python, but you’ll have to install pymongo using pip install pymongo if you have the pip installer. The bulk of the changes will be in the listener child class.

The major change in the code includes:

MongoClient creates the MongoClient instance which will be used to interface with the database. The client[‘twitter_db’] call designates the database that is going to be used, and the db[‘twitter_collection’] call selects the collection where the documents will be stored. The json.loads() call converts the string returned from the Twitter API into a json object in Python. Finally, the collection.insert() call inserts the json object into the MongoDB database. From this rather simple change to the Python stream listener all the tweets can be saved into a MongoDB database.

Recalling Tweets from MongoDB

Recalling the tweets from MongoDB database is not too difficult if you understand the basics of Python for loops and dictionaries. The function to retrieve any documents from the database is collection.find(). You are able to specify what you want to find or leave it blank and get all the documents (tweets) returned. For this example, I’ll first just leave it blank to get all the tweets.

After calling the .find() method, Python will return a MongoDB cursor, which can be iterated through in a for loop. The for loop runs the loop for each object in the iterator. If you wanted to print the text from every tweet you would write:

tweet contains one document [or in this case a tweet JSON object] in the sequence that tweet_iterator produces. The loop will change this document to another one until the for loop runs through every document in the iterator.

Since tweets in JSON format contain many subdocuments, it’s important to know what data you are looking and where to find it. The following code snippet is an example of different fields available to examine.

The last [‘field’] represents a property and any [‘fields’] before the last represents subdocuments. The text field is on the top level on any tweet document, this is the the text that is written in the tweet. There is a user subdocument with a lot of information in there. The code above pulls the screen_name and the user’s given name and content from retweeted tweets. If I were to retweet Barack Obama, you’d be to pull this data about Obama’s tweet from my retweet. I’ve used this to analyze retweet behavior.

Since MongoDB is a database, you are able to query it; you just can’t use SQL. The collection.find() is the method used for querying. Until now I’ve only used empty parameters in the .find() method to return the entire collection. Querying is done in a style similar to JSON.

To find an exact match to a string:

The previous command will find only that exact string in a top level attribute. This isn’t helpful in a practical sense since exact searches aren’t very useful, but it’s the most basic find command. Having MongoDB pull twitter by a given user’s screen_name has some uses, but it is in a a subdocument so it requires some new syntax "document.subdocument":

The above code will search the screen_name property in the user subdocument. Since I’ve shown exact searches, you can search for particular words using a regular expressions operator. This will search the text property to see if it can finds ‘word’ anywhere and return the entire tweet.

Since in the MongoDB the tweets might not have the same fields or properties, sometimes just searching to see if a property exists is useful. For example, if you wanted to find all the native retweets in your collection the following snippet is will return any tweet with a retweeted_status property. [The retweeted_status is typically a subdocument containing all the information about the retweeted tweet.]

Conclusion

While using MongoDB has a learning curve, it can be rather useful to store data like tweets. It eliminates the need to write a parser since you effectively parse the data when you retrieve it. Knowing the subdocument structure of the documents in your database and thinking like a programming rather than a SQL database user will help you successful execute analyses in Python using MongoDB for Twitter data.


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB [current page] | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8

Collecting Twitter Data: Using a Python Stream Listener

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8


I use the term stream listener [2 words] to refer to program build with this code and StreamListener [1 word] to refer to the specific class from the tweepy package. The two are related but not the same. The StreamListener class makes the stream listener program what it is, but the program entails more than the class.

While using R and its streamR package to scrape Twitter data works well, Python allows more customization than R does. It also has a steeper learning curve, because the coding is more invovled. Before using Python to scrape Twitter data, a software package like tweepy must be installed. If you have the pip installer installed on your system, the installation procedure is rather easy and executed in the Terminal.

Call Tweepy Library

Terminal:

After the software package is installed, you can start writing a stream listener script. First, the libraries have to be imported.

The three tweepy class imports will be used to construct the stream listener, the time library will be used create a time-out feature for the script, and the os library will be used to set your working directory.

Set Variables Values

Before diving into constructing the stream listener, let’s set some variables. These variables will be used in the stream listener by being feed into the tweepy objects. I code them as variables instead of directly into the functions so that they can be easily changed.

Using and Modifying the Tweepy Classes

I believe that tweet scraping with Python has a steeper learner curve than with R, because Python is dependent on combining instances of different classes. If you don’t understand the basics of object-oriented programming, it might be difficult to comprehend what the code is accomplishing or how to manipulate the code. The code I show in this post does the following:

  • Creates an OAuthHandler instance to handle OAuth credentials
  • Creates a listener instance with a start time and time limit parameters passed to it
  • Creates an StreamListener instance with the OAuthHandler instance and the listener instance

Before these instances are created, we have to “modify” the StreamListener class by creating a child class to output the data into a .csv file.

This is the most complicated section of this code. The code rewrite the actions taken when the StreamListener instance receives data [the tweet JSON].

This block of code opens an output file, writes the opening square bracket, writes the JSON data as text separated by commas, then inserts a closing square bracket, and closes the document. This is the standard JSON format with each Twitter object acting as an element in a JavaScript array. If you bring this into R or Python built-in parser and the json library can properly handle it.

This section can be modified to or modify the JSON file. For example you can place other properties/fields like a UNIX time stamp or a random variable into the JSON. You can also modified the output file or eliminate the need for a .csv file and insert the tweet directly into a MongoDB database. As it is written, this will produce a file that can be parsed by Python’s json class.
After the child class is created we can create the instances and start the stream listener.

Here the OAuthHandler uses your API keys [consumer key & consumer secret key] to create the auth object. The access token, which is unique to an individual user [not an application], is set in the following line. Unlike the filterStream() function in R, this will take all four of your credentials from the Twitter Dev site. The modified StreamListener class simply called listener is used to create an listener instance. This contains the information about what to do with the data once it comes back from the Twitter API call. Both the listener and auth instances are used to create the Stream instance which combines the authentication credentials with the instructions on what to do with the retrieved data. The Stream class also contains a method for filtering the Twitter Stream. This method works just like the R filterStream() function taking similar parameters, because the parameters are passed to the Stream API call.

Python vs R

At this stage in the tutorial, I would recommend parsing this data using the parser in R from the last section of the Twitter tutorial or creating your own. Since it’s easier to customize the StreamListener methods in Python, I prefer to use it over other R. Generally, I think Python works better for collecting and processing data, but isn’t as easy to use for most statistical analysis. Since tweet scraping would fall into the data collection category, I like Python. It becomes easier to access databases and to manipulate the data when you are already working in Python.

11-10-2015 — I’ve updated the StreamListener to output properly formatted JSON. The old script which works well with R’s tweetParse is still available on my GitHub.

 


 

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener [current page] | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8