Emoji, UTF-8, and Python

📅December 5, 2014

I have updated [better] code that allows for easy counting of emoji’s in string objects in Python, it can be found on my GitHub. I have a two counting classes in a mini-package loaded there.

Emoji [], those ubiquitous emoticons that popped up when iPhone users found them in 2011 with iOS 5 are a different set of characters aside from the traditional alphanumeric and punctuation characters. These are essentially another alphabet, and this concept will be useful when using the emoji in Python. Emoji are NOT a font like Wingdings from Windows95, they are unique characters with no corresponding letter or symbol representation. If you have a document or webpage that has the Wingding font, you can simply change the font to a typical Latin font to see the normal characters the Wingding font represents.

Technical Background

Without getting into the technical encoding problems, emoji are defined in Unicode and UTF-8, which can represent just about a million characters. A lot of applications or software packages default to ASCII, which only encodes the typical 128 characters. Some Python IDEs, csv writing packages, or parsing software default to or translate to ASCII, so they don’t necessarily handle the emoji characters properly.

I wrote a Python script [or this Python ‘package’] that takes tweets that are stored in a MongoDB database (more on that later) and counts the number of different emoji in the tweet corpus. To make sure Python plays nice with the emojis, first I loaded in the data by making sure I had UTF-8 encoding specified otherwise you’ll get this encoding error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 0: ordinal not in range(128)

I loaded an emoji key I made using all the emoji’s in Apple’s implementation by loading this code into a Panda’s data frame:

emoji_key = pd.read_csv('emoji_table.txt', encoding='utf-8', index_col=0)

If Python loads you data in correctly with UTF-8 encoding, each emoji will be treated as separate unique character, so string function and regular expressions can be used to find the emoji’s in other strings such as Twitter text. In some IDEs emoji’s don’t display [Canopy] or don’t display well [PyCharm]. I remedied the invisible/messy emoji’s by running the script in Mac OS X’s terminal application, which displays emoji . Python can also produce an ASCII compliant string by using a unicode escape encoding:

unicode_object.encode('unicode_escape')

The escape encoded string will display something like this:

\U0001f604

All IDEs will display the ASCII string. You would need to decode it from the unicode escape to get it back into a unicode object. Ultimately I had a Pandas data frame containing unicode objects. To make sure the correct encoding was used on the output text file, I used the following code:

 with open('emoji_out.csv', 'w') as f: 
    emoji_count.to_csv(f, sep=',', index = False, encoding='utf-8')

Emoji Counter Class

I made an emoji counter class in Python to simplify the process of counting and aggregating emoji counts. The code [socialmediaparse] is on my GitHub along with the necessary emoji data file, so it can load the key when the instance is created. Using the package, you can repeatedly call the add_emoji_count() method to change the internal count for each emoji. The results can be retrieved using the .dict, dict_total, and .baskets attributes of the instance. I wrote this because it organizes and simplifies the analysis for any social media or emoji application. Separate emoji dictionary counter objects can be created for different sets of tweets that someone would want to analyze.

import socialmediaparse as smp #loads the package

counter = smp.EmojiDict() #initializes the EmojiDict class

#goes through list of unicode objects calling the add_emoji_count method for each string
#the method keeps track of the emoji count in the attributes of the instance
for unicode_string in collection:
   counter.add_emoji_count(unicode_string)  

#output of the instance
print counter.dict_total #dict of the absolute total count of the emojis in corpus
print counter.dict       #dict of the count of strings with the emoji in corpus
print counter.baskets    #list of lists, emoji in each string.  one list for each string.

counter.create_csv(file='emoji_out.csv')  #method for creating csv

Project

MongoDB was used for this project because the data stores the JSON files very well, not needing a parser or a csv writer. It also has the advantage of natively storing strings in UTF-8. If I used R’s StreamR csv parser, there would be many encoding errors and virtually no emoji’s present in the data. There might be possible work arounds, but MongoDB was the easiest way I’ve found to work with Twitter JSON, UTF-8 encoded data.