Skip to Main Content

Analyzing Text Data

Web Scraping Best Practices

Web scraping is a powerful tool, but many websites and web content providers do not encourage web scraping of their sites or even actively deter it. Their reasons for doing so may include the following:

  • the site's owners want to maintain control over their content and its dissemination, and they don't want people to access that content in bulk;
  • the site's administrators want to prevent malicious actors from accessing the site; such actors often use techniques that are indistinguishable from web-scraping techniques;
  • the site's administrators want to conserve bandwidth and server resources for non-automated uses of the site.

Web scraping a site that does not permit it may result in your being blocked from accessing the site altogether.

When web scraping a site, always do the following:

  • Consult the robots.txt file, if it exists. Usually in the root directory of a website, this file specifies which (if any) portions of the site are available to be scraped, and which portions the site administrators have designated as off limits.
  • Use rate limiting. Sometimes rate limits are specified in the robots.txt file; if they're not, it's best practice to insert a one-second delay between each request (when making automated requests). This delay helps prevent your script from overwhelming the site's servers, and it makes it less likely that your script will be flagged as a malicious actor.
  • If you obtain data from web scraping, be careful when sharing that data publicly. Even if the data is publicly available on the source website, it may be published under a license that limits or prohibits public re-use. Publicly sharing your dataset where you do not have explicit permission may leave you open to legal challenges.

Web scraping

Out-of-the-Box Software

Programmatic

 

Python

R

GW Libraries • 2130 H Street NW • Washington DC 20052202.994.6558AskUs@gwu.edu