Collecting Twitter Data: Converting Twitter JSON to CSV — UTF-8

Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]


The main drawback to the ASCII CSV parser and the csv library and is that it can’t handle unicode characters or objects. I want to be able to make a csv file that is encoding in UTF-8, so that will have to be done from scratch. The basic structure follows the previous ASCII post so the json Python object description can be found on the previous tutorial.

io.open

First, to handle the UTF-8 encoding, I used the io.open class. For the sake of consistency, I used this class for both reading the JSON file and writing the CSV file. This actually doesn’t require much change to the structure of the program, but it’s an important change. The json.loads() reads the JSON data and parses it into an object you can access like a Python dictionary.

Unicode Object Instead of List

Since this program uses the write() method instead of a csv.writerow() method, and the write() method requires a string or in this case a unicode object instead of a list. Commas have to be manually inserted into the string to properly. For the field names, I just rewrote the line of code to be a unicode string instead of the list used for the ASCII parser. The u'*string*' is the syntax for a unicode string, which behave similarly to normal strings, but they are different. Using the wrong type of string can cause compatibly issues. The line of code that uses the u'\n' creates a new line in the CSV. Once again this is need in this parser needs to insert the new line character to create a new line in the CSV file.

The for loop and Delimiters

This might be the biggest change relative to the ASCII program. Since this is a CSV parser made from scratch, the delimiters have to be programmed in. For this flavor of CSV, it will have the text field entirely enclosed by quotation marks (") and use commas (,) to separate the different fields. To account for the possibility of having quotation marks in the actual text content, any real quotation marks will be designated by double quotes (""). This can give rise to triple quotes, which happens if a quotation mark starts or ends a tweet’s text field.

This parser implements the delimiters requirements of the text fields by

  1. Replacing all quotation marks with double quotes in the text.
  2. Adding quotation marks to the beginning and end of the unicode string

Joining the row list using a comma as a separator is a quick way to write the unicode string for the line of the CSV file.

The full code I used in this tutorial can be found on my GitHub .


Part I: Introduction | Part II: Getting Started | Part III: Using a Python Stream Listener | Part IV: Storing Tweets in MongoDB | Part V: Twitter JSON to CSV — Errors | Part VI: Twitter JSON to CSV — ASCII | Part VII: Twitter JSON to CSV — UTF-8 [current page]