Collecting Twitter Data: Converting Twitter JSON to CSV — UTF-8

📅November 11, 2015

The main drawback to the ASCII CSV parser and the csv library and is that it can’t handle unicode characters or objects. I want to be able to make a csv file that is encoding in UTF-8, so that will have to be done from scratch. The basic structure follows the previous ASCII post so the json Python object description can be found on the previous tutorial.

io.open

First, to handle the UTF-8 encoding, I used the io.open class. For the sake of consistency, I used this class for both reading the JSON file and writing the CSV file. This actually doesn’t require much change to the structure of the program, but it’s an important change. The json.loads() reads the JSON data and parses it into an object you can access like a Python dictionary.

import json
import csv
import io

data_json = io.open('raw_tweets.json', mode='r', encoding='utf-8').read() #reads in the JSON file
data_python = json.loads(data_json)

csv_out = io.open('tweets_out_utf8.csv', mode='w', encoding='utf-8') #opens csv file

Unicode Object Instead of List

Since this program uses the write() method instead of a csv.writerow() method, and the write() method requires a string or in this case a unicode object instead of a list. Commas have to be manually inserted into the string to properly. For the field names, I just rewrote the line of code to be a unicode string instead of the list used for the ASCII parser. The u'*string*' is the syntax for a unicode string, which behave similarly to normal strings, but they are different. Using the wrong type of string can cause compatibly issues. The line of code that uses the u'\n' creates a new line in the CSV. Once again this is need in this parser needs to insert the new line character to create a new line in the CSV file.

fields = u'created_at,text,screen_name,followers,friends,rt,fav' #field names
csv_out.write(fields)
csv_out.write(u'\n')

The for loop and Delimiters

This might be the biggest change relative to the ASCII program. Since this is a CSV parser made from scratch, the delimiters have to be programmed in. For this flavor of CSV, it will have the text field entirely enclosed by quotation marks (") and use commas (,) to separate the different fields. To account for the possibility of having quotation marks in the actual text content, any real quotation marks will be designated by double quotes (""). This can give rise to triple quotes, which happens if a quotation mark starts or ends a tweet’s text field.

for line in data_python:

    #writes a row and gets the fields from the json object
    #screen_name and followers/friends are found on the second level hence two get methods
    row = [line.get('created_at'),
           '"' + line.get('text').replace('"','""') + '"', #creates double quotes
           line.get('user').get('screen_name'),
           unicode(line.get('user').get('followers_count')),
           unicode(line.get('user').get('friends_count')),
           unicode(line.get('retweet_count')),
           unicode(line.get('favorite_count'))]

    row_joined = u','.join(row)
    csv_out.write(row_joined)
    csv_out.write(u'\n')


csv_out.close()

This parser implements the delimiters requirements of the text fields by

Replacing all quotation marks with double quotes in the text.
Adding quotation marks to the beginning and end of the unicode string

'"' + line.get('text').replace('"','""') + '"', #creates double quotes

Joining the row list using a comma as a separator is a quick way to write the unicode string for the line of the CSV file.

row_joined = u','.join(row)

The full code I used in this tutorial can be found on my GitHub .