Collecting Twitter Data: Converting Twitter JSON to CSV — Possible Errors

📅November 10, 2015

ASCII JSON-to-CSV | UTF-8 JSON-to-CSV

Before diving into the problem of how to save tweets in a CSV file, let me say there are a 1,000 ways to do this and about 100 complications that arise depending which way you want to accomplish this. I will devote two posts which covers using both ASCII and UTF-8 encoding because many tweets contain characters beyond the normal Latin alphabet.

Let’s look at some of the issues with writing CSV from tweets.

Tweets are JSON and contain a massive amount of metadata. More than you probably want.
The JSON isn’t a flat structure; it has levels. [Direct contrast to a CSV file.]
The JSON files don’t all have the same elements.
There are many foreign languages and emoji used in tweets.
Tweets contain many different grammatical marks such as commas and quotation marks.

These issues aren’t incredibly daunting, but those unfamiliar will encounter frustrating errors.

Tweets are JSON and contain a massive amount of metadata. More than you probably want.

I’m always in favor of keeping as much data as possible, but tweets contain a massive amount of different metadata attributes. All of these are designed for the Twitter platform and for the associated client apps. Some items like the profile_background_image_url_https really don’t have much of an impact on any analysis. Choosing which attributes you want to keep will be critical before embarking on a process to parse the data into a CSV. There’s a lot to choose from: timestamp data, user data, retweet data, geocoding data, hashtag data and link data.

The JSON isn’t a flat structure; it has levels.

This issue is an extension of the previous issue, since tweet JSON data isn’t organized into a flat, spreadsheet-like structure. The created_at and text elements are located on the top level and are easy to access, but something as simple as the tweeter’s name and screen_name are located in the user nested object. Like everything else mentioned in this post, this isn’t a huge issue, but the structure of a tweet JSON file has to be considered when coding your program.

The JSON files don’t all have the same elements.

The final problem with JSON files is the fields aren’t necessarily present in every object. Many geo related attributes do not appear unless geotagging is enabled. This means if you write your program to look for geotagging data, it can throw a key error if those keys don’t exist in that specific tweet. To avoid this you have to account for the exception or use a method that already does that. I use the get() method to avoid these key errors in the CSV parser.

There are many foreign languages and emoji used in tweets.

I quickly addressed this issue in a few posts, and it’s one of the reasons why I like to store tweets in MongoDB. Tweets contain a lot of of [read: important] unicode characters. These are typically many foreign language characters and the ubiquitous emojis. This is important because the presence of UTF-8 unicode characters can and will cause encoding errors when parser a file or loading a file into Excel. Excel (at least the version on my computer) can’t handle these characters. Other tools like the built-in CSV writer in Python can’t handle unicode out of box. Being able to deal with these characters is critical to compatibility with other software as long as the integrity of your data.

This issue forces me to write two different parsers for examples. I have a CSV parser that outputs ASCII that imports well into Excel along with a UTF-8 version which allows you to natively save the characters and emojis in a human-readable CSV file.

Tweets contain many different grammatical marks such as commas and quotation marks.

This is a problem that I had when I first started working with Twitter data and tried to write my own parser — characters that are part of your text content sometimes get confused with the delimiters. In this case I’m talking about quotation marks (") and commas (,). Comma sseparate the values for each ‘cell’, hence the acronym CSV. If you tweet you’ve probably tweeted using one of these characters. I’ve stripped them out of the text previously to solve this problem, but that’s not a great solution. The way Excel handles this is to enclose any elements that contain commas with quotation marks then to use double quotation marks to signify an actual quotation mark and not enclosed text. This will be demonstrated in the UTF-8 parser since I made that from scratch.