Skip to Main Content

TDM Studio

ProQuest TDM Studio is a text- and data-mining platform that provides researchers with access to vast amounts of ProQuest's content for computational analysis.

Sentiment Analysis Data Visualization

Sentiment Analysis or Affect Classification can be valuable for lots of different research and learning objectives. Some common questions which researchers often explore using Sentiment Analysis include:

  • What public emotions drive successful presidential campaigns? Social control and power are driven by which emotions? Do U.S. presidents win elections with fear or anger or love?
  • How does public sentiment about a company (e.g. Tesla or GameStop), as reported by newspapers, relate to company stock price over time? 
  • What is the long-term emotional impact of collective trauma? How does public sentiment change (and recover?) in response to tragic events?  

For this documentation walkthrough, we will specifically be looking at one month of newspaper coverage for September 2001. How does collective emotion, as expressed in a newspaper such as The New York Times change and respond following a tragic event such as the September 11, 2001 terrorist attack? We create a dataset of the 8851 newspaper articles which were published for the month of September 2001 from The New York Times. 

Affect Classification

Sentiment Analysis research often attempts to assign a positive or negative score (e.g. Likert scale) to text at the sentence level. Different sentiment analysis systems use different scales and approaches to the problem. One common approach is to use a 1-5, very negative to very positive, scale. This approach also often overlaps with opinion mining and product reviews. 

For TDM Studio, however, we attempt to assign an affective state or emotion to each sentence in a document. This approach is slightly different because it focuses on assigning emotions instead of a positive or negative score to text.

We use BERT-based, sentence embeddings to represent each sentence in a dataset. We then train a model using the sentence embeddings to predict the probability of each sentence being assigned to each affective state (i.e. ‘Anger’, ‘Disgust’, ‘Fear’, ‘Sadness’, ‘Happiness’, ‘Love’, ‘Surprise’, ‘Neutral’, ‘Other’). One thing to note is that the emotions which are expressed vary between different domains—The emotions that are important in a teaching and learning context are different than those in a research context. For TDM Studio, we chose primary emotions based on research by Ekman’s and others. The initial work for this classification system was develop as part of a pilot exploratory research project with the University of Michigan.

For training the classification model, we use a combination of newspaper as well as literary data. Depending on the task, as well as the time period, the training data used for affect classification will impact the results of the classifier.

In the first ten days of September, 2001, we can see that the most common emotions expressed in The New York Times are: Neutral, Happiness, and Surprise. We can also see that less common emotions are: Fear, Love and Anger.  Looking at different newspapers as well as different time periods, do the most common emotions change over time?  Were the most common emotions for the first ten days of September 2020 the same as the most common emotions for the previous one hundred Septembers?

Sentiment Over Time

For each document in the dataset, we break up the document into sentences. Each sentence is then assigned an affect probability for each of the nine classes. These sentence probabilities are averaged at the document level to create document-level scores which are presented in the article drawer. The document-level probabilities are then averaged for all of the documents in a specific publication date range to track affect over time.

For the example of the September 11 terrorist attack, we can see that emotion changes drastically on September 12th following the terrorist attack. The proportion of negative emotions, namely Sadness and Fear, increases by 50-100%. Both of these emotions remain at higher levels for the remainder of September—How long does it take for expressed Sadness and Fear to return to pre-Sept. 11 levels? Also interestingly, we can see that that other negative emotions, such as Anger and Disgust, do not increase following the terrorist attack.

List of Articles

By clicking either on a specific emotion line or on a specific date on the emotion line, we can see the articles in the dataset with high expressed emotion. This list is ordered with the articles with the highest emotion scores for the selected emotion at the start of the list.

The emotion score next to the document title is the average score across sentences for the specific document. An interesting classroom activity is to select an article with high, single emotion and then read the article and see if you agree with the affect classification.

For example, if we clear on Fear for September 12th, we can see some of the headlines following the attacks: “More on the Attacks”, “When an Open Society is Wielded as a Weapon Against Itself”, “Bush Aides Say Attacks Don’t Recast Shield Debate”.

Similarly, when we select Sadness over time, we see many obituaries at the top of the list.

Export Data

For each dataset, you can export the emotion data as well as the metadata.

The documentmetadata file includes metadata for the articles in your dataset. The emotion_docs file includes emotion probabilities at the document level. The sentences within the article which have the highest expressed emotion can be determined using this emotion_docs file.

The emotion_time file aggregates these document-level emotion probabilities over time. 

Additional Recommended Reading

Brahma, S., 2018. Improved sentence modeling using suffix bidirectional lstm. arXiv preprint arXiv:1805.07340.

Herzig, J., Shmueli-Scheuer, M. and Konopnicki, D., 2017, October. Emotion detection from text via ensemble classification using word embeddings. In Proceedings of the ACM SIGIR international conference on theory of information retrieval (pp. 269-272).

Mac Kim, S., Valitutti, A. and Calvo, R.A., 2010, June. Evaluation of unsupervised emotion models to textual affect recognition. In Proceedings of the NAACL HLT 2010 Workshop on Computational Approaches to Analysis and Generation of Emotion in Text (pp. 62-70).

Reimers, N. and Gurevych, I., 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.

Silge, J. and Robinson, D., 2017. Text mining with R: A tidy approach. " O'Reilly Media, Inc.".

GW Libraries • 2130 H Street NW • Washington DC 20052202.994.6558AskUs@gwu.edu