Research Guides: Analyzing Text Data: Text Analysis Methods

Choosing a Method

Choosing the right text mining method is crucial because it significantly impacts the quality of insights and information you can extract from your text data. Each method provides different insights and requires different amounts of data, training, and iteration. Before you search for data, it is essential that you:

identify the goals of your analysis
determine the method you will use to meet those goals
identify how much data you need for that method
develop a sampling plan to build a data set that accurately represents your object of study.

Starting with this information in mind will make your project go more quickly and smoothly, and help you overcome a lot of hurdles such as incomplete data, too much or too little data, or problems with access to data.

More Resources:

How Much Data Do You Need?

Before you start collecting data, think about how much data you really need. New researchers in text analysis often want to collect every source mentioning their topic, but this is usually not the best approach. Collecting so much data takes a lot of time, uses many computational resources, often goes against platform terms of service, and doesn't necessarily improve analysis.

In text analysis, an essential idea is saturation, where adding more data doesn't significantly improve performance. Saturation is when the model has learned as much as it can from the available data, and no new patterns are themes are emerging with additional data. Researchers often use experimentation and learning curves to determine when saturation occurs; you can start by analyzing a small or mid-sized dataset and see what happens if you add more data.

Once you know your research question, the next step is to create a sampling plan. In text analysis, sampling means selecting a representative subset of data from a larger dataset for analysis. This subset, called the sample, aims to capture the diversity of sentiments in the overall dataset. The goal is to analyze this smaller portion to draw conclusions about the information in the entire dataset.

For example, in a large collection of customer reviews, sampling may involve randomly selecting a subset for sentiment analysis instead of analyzing every single review. This approach saves computational resources and time while still providing insights into the overall sentiment distribution of the entire dataset. It's crucial to ensure that the sample accurately reflects the diversity of sentiments in the complete dataset for valid and reliable generalizations.

Example Sampling Plans

Sampling plans for text analysis involve selecting a subset of text data for analysis rather than analyzing the entire dataset. Here are two common sampling plans for text analysis:

Random Sampling:
- Description: Randomly select a subset of text documents from the entire dataset.
- Process: Assign each document a unique identifier and use a random number generator to choose documents for inclusion in the sample.
Stratified Sampling:
- Description: Divide the dataset into distinct strata or categories based on certain characteristics (e.g., product types, genres, age groups, race or ethnicity). Then, randomly sample from each stratum.
- Process: Divide the dataset into strata, and within each stratum, use random sampling to select a representative subset.

Remember, the choice of sampling plan depends on the specific goals of the analysis and the characteristics of the dataset. Random sampling is straightforward and commonly used when there's no need to account for specific characteristics in the dataset. Stratified sampling is useful when the dataset has distinct groups, and you want to ensure representation from each group in the sample.

Exactly How Many Sources do I need?

Determining the amount of data needed for text analysis involves a balance between having enough data to train a reliable model and avoiding unnecessary computational costs. The ideal dataset size depends on several factors, including the complexity of the task, the diversity of the data, and the specific algorithms or models being used.

Task Complexity: If you are doing a simple task, like sentiment analysis or basic text classification, a few dozen articles might be enough. More complex tasks, like language translation or summarization, often require datasets on the scale of tens of thousands to millions.
Model Complexity: Simple models like Naive Bayes often perform well with smaller datasets, whereas complex models, such as deep learning models with many parameters, will require larger datasets for effective training.
Data Diversity: Ensure that the dataset is diverse and representative of the various scenarios the model will encounter. A more diverse dataset can lead to a more robust and generalizable model. A large dataset that is not diverse will yield worse results than a smaller, more diverse dataset.
Domain-Specific Considerations: Sometimes there is not a lot of data available, and it is okay to make do with what you have!

Start by taking a look at articles in your field that have done a similar analysis. What approaches did they take? You can also schedule an appointment with a Data Services Librarian to get you started.

More Readings on Sampling Plans for Text Analysis:

How to Choose a Sample Size in Qualitative Research, from LinkedIn Learning (members of the GW community have free access to LinkedIn Learning using their GW email account)
Sampling in Qualitative Research, from Saylor Academy
Lowe, A., Norris, A. C., Farris, A. J., & Babbage, D. R. (2018). Quantifying Thematic Saturation in Qualitative Data Analysis. Field Methods, 30(3), 191-207. https://doi.org/10.1177/1525822X17749386

Word Frequency Analysis

Word frequency analysis in text mining is a technique that involves counting how often each word appears in a given collection of text data, such as documents, articles, or web pages. It helps identify the most frequently occurring words and their frequencies. This analysis is essential for understanding the importance and prevalence of words within the text, which can be used for tasks like identifying keywords, determining common themes, or detecting anomalies in a dataset. Word frequency analysis provides valuable insights into the structure and content of textual information, aiding in various text mining and natural language processing tasks.

Software for Word Frequency Analysis

NVivo via GW's Virtual Computer Lab
NVivo is a software package used for qualitative data analysis. It includes tools to support the analysis of textual data in a wide array of formats, as well as and audio, video, and image data. NVivo is available through the Virtual Computer Lab. Faculty and staff may find NVivo available for download from GW's Software Center.
Analyzing Word and Document Frequency in R
This chapter explains how to use tidy to analyze word and document frequency using Tidy Data in R.
word clouds in R
R programming functionality to create pretty word clouds, visualize differences and similarities between documents, and avoid over-plotting in scatter plots with text.
ATLAS.ti
Trial version of qualitative analysis workbench for processing text, image, audio, and video data. (Note: Health science students may have access to full version through Himmelfarb Library)

Related Tools Available Online

Google ngram Viewer
When you enter phrases into the Google Books Ngram Viewer, it displays a graph showing how those phrases have occurred in a corpus of books (e.g., "British English", "English Fiction", "French") over the selected years.

HathiTrust This link opens in a new window
HathiTrust is a partnership of academic and research institutions, offering a collection of millions of titles digitized from libraries around the world. To log in, select The George Washington University as your institution, then log in with your UserID and regular GW password.

Voyant
Voyant is an online point-and-click tool for text analysis. While the default graphics are impressive, it allows limited customizing of analysis and graphs and may be most useful for exploratory visualization.

Related Library Resources

HathiTrust and Text Mining at GWU
HathiTrust is an international community of research libraries committed to the long-term curation and availability of the cultural record. Through their common efforts and deep commitment to the public good, the libraries support the teaching and learning activities of the faculty, students or researchers at their home institutions, and the scholarly needs of the broader public as well.
HathiTrust+Bookworn
From the University of Illinois Library: HathiTrust+Bookworm is an online tool for visualizing trends in language over time. Developed by the HathiTrust Research Center using textual data from the HathiTrust Digital Library, it allows you to track changes in word use based on publication country, genre of works, and more.
Python for Natural Language Processing
A workshop offered through GW Libraries on natural language processing using Python.
Text Mining Tutorials in R
A collection of text mining course materials and tutorials developed for humanists and social scientists interested to learn R.

Oxford English Dictionary This link opens in a new window
The Oxford English Dictionary database will provide a word frequency analysis over time, drawing both from Google ngrams and the OED's own databases.

Example Projects Using Word Frequency Analysis

Robinson, J. S. and D. (n.d.). 3 Analyzing word and document frequency: Tf-idf | Text Mining with R. Retrieved November 21, 2023, from https://www.tidytextmining.com/tfidf.html
Zhang, Z. (n.d.). Text Mining for Social and Behavioral Research Using R. Retrieved November 21, 2023, from https://books.psychstat.org/textmining/index.html

Exploring Fascinating Insights with Word Frequency Analysis
In the realm of data analysis, words hold immense power. They convey meaning, express ideas, and shape our understanding of the world. In this article, we’ll explore the fascinating world of textual data analysis by examining word frequencies. By counting the occurrence of words in a text, we can uncover interesting insights and gain a deeper understanding of the underlying themes and patterns. Join us on this word-centric journey as we dive into the realm of word frequency analysis using Python.

Machine Learning/Natural Language Processing

Machine learning for text analysis is a technology that teaches computers to understand and interpret written language by exposing them to examples. There are two types of machine learning for text analysis: supervised learning, in which a human helps to train the computer to detect patterns, and unsupervised learning, which enables computers to automatically categorize, analyze, and extract information from text without needing explicit programming.

One type of machine learning for text analysis is Natural Language Processing (NLP). NLP for text analysis is a field of artificial intelligence that involves the development and application of algorithms to automatically process, understand, and extract meaningful information from human language in textual form. NLP techniques are used to analyze and derive insights from large volumes of text data, enabling tasks such as sentiment analysis, named entity recognition, text classification, and language translation. The aim is to equip computers with the capability to comprehend and interpret written language, making it possible to automate various aspects of text-based information processing.

Software for Natural Language Processing

NVivo via GW's Virtual Computer Lab
NVivo is a software package used for qualitative data analysis. It includes tools to support the analysis of textual data in a wide array of formats, as well as and audio, video, and image data. NVivo is available through the Virtual Computer Lab. Faculty and staff may find NVivo available for download from GW's Software Center.
NLTK for Python
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
scikit
Simple and efficient tools for predictive data analysis, using Python.

Related Resources Available Online

[Large Language Models] LLMs in Scientific Research

HathiTrust+Bookworn
From the University of Illinois Library: HathiTrust+Bookworm is an online tool for visualizing trends in language over time. Developed by the HathiTrust Research Center using textual data from the HathiTrust Digital Library, it allows you to track changes in word use based on publication country, genre of works, and more.

Related Library Resources

HathiTrust and Text Mining at GWU
Information on text data mining using HathiTrust
Social Feed Manager
Social Feed Manager software was developed to support campus research about social media including Twitter, Tumblr, Flickr, and Sina Weibo platforms. It can be used to track mentions of you or your articles and other research products for the previous seven days and on into the future.. Email libdata@gwu.edu to get started with Social Feed Manager or to schedule a consultation

Example Projects using Natural Language Processing

Sentiment Analysis

Sentiment analysis is a method of analyzing text to determine whether the emotional tone or sentiment expressed in a piece of text is positive, negative, or neutral. Sentiment analysis is commonly used in businesses to gauge customer feedback, social media monitoring, and market research.

Software for Sentiment Analysis

NVivo via GW's Virtual Computer Lab
NVivo is a software package used for qualitative data analysis. It includes tools to support the analysis of textual data in a wide array of formats, as well as and audio, video, and image data. NVivo is available through the Virtual Computer Lab. Faculty and staff may find NVivo available for download from GW's Software Center.
Sentiment Analysis using NLTK for Python
NLTK is a leading platform for building Python programs to work with human language data. It provides easy-to-use interfaces to over 50 corpora and lexical resources such as WordNet, along with a suite of text processing libraries for classification, tokenization, stemming, tagging, parsing, and semantic reasoning, wrappers for industrial-strength NLP libraries, and an active discussion forum.
Sentiment Analysis with TidyData in R
This chapter shows how to implement sentiment analysis using tidy data principles in R.
Tableau
Tableau works with numeric and categorical data to produce advanced graphics. Browse the Tableau public gallery to see examples of visuals and dashboards. Tableau offers free one-year Tableau licenses to students at accredited academic institutions, including GW. Visit https://www.tableau.com/academic/students for more about the program or to request a license.
Qualtrics Text iQ
Qualtrics is a powerful tool for collecting and analyzing survey data. Qualtrics Text iQ automatically performs sentiment analysis on collected data.

Related Resources Available Online

finnstats. (2021, May 16). Sentiment analysis in R | R-bloggers. https://www.r-bloggers.com/2021/05/sentiment-analysis-in-r-3/

Example Projects Using Sentiment Analysis