Basic Text Analytics for News Bias
Bias is a problem every news media outlet has in some form beyond the well-debated political slants that Fox News and MSNBC are renown for. I’ve been attempting to quantify biases using text analytics. By looking at the frequency and topics of articles, word choices, and associated words, I believe that you can find analytical evidence to better understand the how different news outlets are communicating their news.
My first attempt at this has a simple approach: measure and compare the frequency of specific key terms. I used the current topics of Ebola and the midterm election, which will demonstrate some polarization. To summarize the news content, the data was collected towards the tail end of the quarantine-issue news cycle, so there have been political debates on how to handle health-care workers returning to the United States. Oversimplifying, conservatives favor hardline precautions like quarantine, while liberals generally favor the present policy of self-monitoring. The election articles reflect news articles from the weekend before a midterm election where Republicans are favored in the polls to take control of the Senate.
All the articles were gathered from scraping Google search results for ‘ebola+[news outlet]’ or ‘election+[news outlet]’ with a Python script. So the data will reflect data recent news articles relative to November 1, 2014. The text was analyzed by counting specific terms in the articles and the total word count of each article. For those Python-orientated readers, I used the TextBlob package for the n-gram/count methods.
Getting an idea of what the collection of news articles looks like, there are about 100 articles per news outlet and topic, which is what Google returns on the first page of results. All duplicate articles and non-outlet domains [both these restrictions used URLS] are removed, so the number might be less than 100. I’m also scraping Google’s news search site meant for normal web use, so there are related article links attached to some of the results possibly pushing the total results over 100.
Generally, longer articles can provide more detailed information or complex arguments, and it will also be taken into consideration when calculating a term count for articles from the news outlets. The New York Times has by far the longest articles, while NBC News has the shortest.
I assembled a count of certain terms associated with Ebola and averaged those across all the articles. Not surprisingly, out of the the terms I chose, ‘quarantine’ appeared the most with the most frequent mentions by Fox News. An associated term ‘Hickox’, the name of the nurse who was quarantined in NJ and ME, was also used often, but mostly by NBC News. Even though Fox News mentioned quarantining more often, it did not mention the name of the nurse nearly as often. Conversely, NBC News mentioned ‘Hickox’ more often than they did quarantine. Since this is just basic text analysis, I’m hesitant to draw too many conclusions on what the coverage bias means for the new outlet’s slant.
Similar to the Ebola term count, I gathered similar information for articles about the midterm elections. There wasn’t much disparity in the frequency the articles used terms like there was for Ebola. The most notable pattern was that NPR had strikingly few explicit mentions of political parties or philosophies possibility indicating their strategy to avoid politicizing articles. Fox News and NBC News differed the most in their use of the word ‘liberal’, which is slightly pejorative in conservative circle. This could act as confirmation evidence of the outlet’s well-known slants, but I would insist on further investigation and better evidence.
For those curious about the calculations of the term metrics, it’s the TERM COUNT/ ARTICLE WORD COUNT averaged over all the articles for the outlet and subject, so the measurements on the graphs are essentially average term proportions per article.
This is just a basic, analytical look at news articles for coverage bias, which is associated with what a news outlet decides to cover or include in articles. More articles, TV transcripts, and social media headlines and comments could provide a richer data set for analysis. And hopefully, I can find emotionally charged words and evaluate opinions. All work for the future.