Twitter Sentiment Analysis
This is a summary of a final project I did from my Introduction to Data Mining class at NU. The goal of the project was to find a business need and execute a data mining process. The general process I used is outlined here and the sentiment lexicon is found here. The lexicon is from a paper: Minqing Hu and Bing Liu. “Mining and Summarizing Customer Reviews.” Proceedings of the ACM SIGKDD International Conference on Knowledge.Discovery and Data Mining (KDD-2004), Aug 22-25, 2004, Seattle, Washington, USA.
My experiences using social media, the business-centric focus of my grad classes, and my love for burritos inspired me to look into Twitter sentiment analysis. [I also needed to research something that wasn’t baseball.] Imagine every time you’ve misinterpreted a text message from your friend. Or every time an irate Twitter follower takes a sarcastic tweet seriously. That’s how hard it is for normal people to correctly interpret sentiment of written communication. So now picture trying to get a computer to do the same thing. Not easy. But at the very least we can find a way to categorize more tweets correctly than we misidentify.
On August 25th & 26th, I scraped tweets containing Twitter handles of some companies:
I choose these based on my personal preferences or companies that I thought might have strong sentiment. The tweets were scraped using R and the package streamR. [These are pretty easy to use, if anyone wants to start doing any Twitter research, I’d start here.] The tweets are saved as JSON files, which are a mess, but human-readable. R parses the tweets into a data frame then that can go into a SQL database if so desired.
The way I determined a tweet’s sentiment was by matching words in the tweet to a predetermined sentiment lexicon. This process is outlined in this presentation, if you are curious about how to execute it. The algorithm is simple enough to write with one FOR loop. The toughest problem I had was dealing with the tweet’s data object. The scoring system was simple, each word from a tweet that matched to the lexicon list will count as a +1 [positive] or -1 [negative]. The words are added together to give a sentiment score. The sentiment score can be interpreted like algebra [0 is neutral, greater than 0 is positive, and less than 0 is negative].
I grouped the results by company:
The bar graph sums the sentiment scores of every tweet I captured. So for example, two negative words in one tweet will drop the company’s total score 2 points. No surprises here that Comcast is in last place. I’ve never heard anyone say anything nice about Comcast, and apparently Twitter is not much kinder. News about the Burger King-Tim Horton’s merger broke while I was capturing tweets, so that accounts for the great discontent about Burger King. I’m not quite sure what BK’s baseline should be since this is the only data I have on hand. Verizon has very positive scores. I am a little skeptical of this, because Verizon (and Apple) had a lot tweets that were ad-based. While it’s great to know who is advertising your product, it’s not the goal of this particular project. Ads and news stories gets retweeted and regurgitated a lot, but the tweets from real customers don’t.
One way to account for large volume of tweets is to look at average sentiment per tweet:
This graph takes the total sentiment score and divides it by the total number of tweets mentioning that company. This will give more weight to a company’s Twitter-customer base that’s strongly opinionated one way or another. To no one’s surprise, Comcast is dead last again. But to my delight, Chipotle ranks first! [I love burritos…and apparently a lot of Twitter users do too.] Chipotle is a good example of a company that does not have the volume that Comcast or Verizon has, but their users feel strongly about their product and tweet positively about it.
Luckily, there is a bunch of metadata available with the tweets including my favorite variable, a timestamp. First, here’s a time series baseline for August 25, 2014:
The volume of tweets increased through out the day, and peaked around lunch in EDT. Let’s look at Chipotle’s time series graph broken up by sentiment classification:
Neutral and positive tweets peaked during the 1PM EDT hour, with no corresponding spike in negative tweets. This is very encouraging for Chipotle since you would expect the negative tweets to follow the same pattern, and they don’t. They actually rise later in the day. Further research would have to be done to determine if this trend is real and what the source of it is. It could be a time zone delayed problem, general staffing/production issues in non-lunch hours, selection bias of people who go to a later lunch, or a random fluctuation that happened that day.
Comcast also has some interesting patterns:
There are two spikes in neutral tweet volume early in the day. I think these are results of mentions in a news related tweet that was retweeted a lot. The large spike at 2PM EDT is probably also caused by retweets as well. However, during the early afternoon, there are distinctive negative customer tweets accounting for the surge in negative sentiment after lunch. My conjecture for this surge would be an increase of people dealing with Comcast’s customer service. It would be interesting to see if call center data matched up.
This is just the most basic implementation of sentiment analysis. There are more advanced machine learning techniques that can weight words differently and looks at consecutive word groups (n-grams) in addition to individual words. The advantage of this can be seen easily. The phrase ‘not good’ is negative, but only looking at the constituent words it would be scored neutral. There is a lot of other processes I can try to get more accurate results, but unfortunately, not in this post.