On Twitter, about 10% of general-topic tweets contain emoji characters, the tiny icons and emoticons, which are starting to get more attention when analyzing tweets, Facebook messages, or text messages. An emoji  can capture an emotion or completely change the meaning of the written text. Before exploring how different emojis are used and what they mean to people, I wanted to get an idea of how prevalent they are and which ones are the most popular on Twitter.
How I Did This
I collected tweets using a sampled stream from Twitter. In order to get a general representative sample of tweets, I tracked five popular, basic words: ‘the’, ‘and’, ‘to’, ‘you’, and ‘it’. These words are good search words, since there aren’t many sentences or thoughts that don’t use them. A Python script was used to find and count all the the emojis present in a collection of over 100,000 tweets. To avoid skewing due to a popular celebrity or viral tweet, I removed any retweets which were obvious retweets, and not retweets which function more like mentions.
In the general collection of tweets, I found that 10.23% of tweets contained at least one emoji. So there isn’t an overwhelming number of tweets which contain an emoji, but 10% of Twitter content is a significant portion. The ‘Emoji Selection’ graph shows the percentage of tweets containing that particular emoji out of the tweets that HAD an emoji in it. The most popular emoji by and far was the ‘tears of joy’ emoji followed by the ‘loudly crying’ emoji . Heart-related emoji [the ones I thought would prove most popular] was third and fourth.
Since I only collected these over the course of a day and not over several weeks or months, I would be hesitant to think these results would hold up over time. An event or seasonality can trigger a cascade of people using a certain emoji. For example, the Christmas tree emoji was popular being present in 2.16% of tweets that included emojis; this would be expected to get larger as we get closer to Christmas and smaller after Christmas. Another interesting find is that the emoji ranks high. My pure conjecture is that this emoji’s high use rate is due to protests in Ferguson and around the country. To confirm this I would need a sample of tweets from before the grand jury announcement or track the use as time passes.
Further analysis could utilize emoji groups or clusters. Emojis with similar meanings would not necessarily produce a high number if people spread their selection over 5 emoji instead of one. I plan to update this and expand on this as time passes and I’m able to collect more data.
In order to avoid any conflicts with ASCII conversions that some Python or R packages do on Twitter data, I stored tweets from the Twitter Streaming API directly into a MongoDB database, which encodes strings in UTF-8. Since tweets come from the API as a JSON object, they can be naturally stored in the document-orientated database with each metadata field in the tweet being accessible without parsing the entire tweet into a data frame or SQL database. Retweets were removed by finding any tweets with ‘RT’ in the first two characters of the text entry. This is how Twitter represents automatic retweets in JSON format.
Also since I collected 103,416 tweets the margin of error for any of the proportions given are well below 1%. Events within the social network would definitely outweigh any margin of error.