Skip to Main Content

Analyzing Text Data

Sources of Text Data

If you are considering text mining and analysis for your research project, you must have text data to analyze. Your text data may be:

  • created as a part of your research, e.g. survey responses, interview transcripts
  • collated as part of your research, e.g. journal articles for literature review, writings of an author
  • collated by a third party, e.g. Senate enquiry transcripts, British National Corpus
  • collated via web scraping, e.g. news feeds, social media posts and comments, website content

Before you collect data, develop a sampling plan to identify how much data you actually need, and best practices in selecting a representative data set.

Finding Text Data in Library Databases

 

A flow chart of whether you can scrape a library resource for text data. If there is no dedicated API or terms of service, contact a librarian.

 

The Library's databases do not allow web scraping due to license agreements with the publishers. However, you may be able to collect data from text sources within the databases, as long as you are using the data exclusively for academic purposes. Each publisher has its own terms, conditions, and copyright provisions, which should be followed at all times. Please contact us if you have questions about using the Library's databases as a data source for your research.

If you are considering collecting data from Library databases, please keep in mind:

  • Some publishers will require you to use tools they provide to mine their content, or will do the research for you. In this way, they can manage the data being accessed and the impact on their servers.

  • Downloading large amounts of data can trigger automatic lockouts and prevent access to resources by other users. The Library or the user can also be fined for unauthorized use of the databases. (Yes, they really do follow up on this!)

 

Definitions

  • Text mining is an umbrella term for using computer programs and algorithms to dig through large amounts of text, like books, articles, websites, or social media posts, to find valuable and hidden information. As such, it refers to the methods applied to a corpus of textual data, rather than to the methods of obtaining such data.

  • API's, in this context, refer to a means of obtaining data (textual or otherwise) in an efficient and automated fashion, though they typically require some programming knowledge to use.

  • Bulk downloading refers to a means of obtaining large quantities of data from the database's public user interface, either automatically through a feature of the interface specifically provided for that purpose, or manually (i.e., by downloading many separate batches of results). Individual databases vary in how much a user can download at one time, please read the guidelines or consult with a librarian before bulk downloading.

  • Web scraping refers to automating the extraction of data from a public-facing website. Web scraping, unlike bulk downloading, requires some programming knowledge. This is not allowed in most library databases.

GW Libraries • 2130 H Street NW • Washington DC 20052202.994.6558AskUs@gwu.edu