Research Guides: Analyzing Text Data: Web Scraping

Web Scraping Best Practices

Web scraping is a powerful tool, but many websites and web content providers do not encourage web scraping of their sites or even actively deter it. Their reasons for doing so may include the following:

the site's owners want to maintain control over their content and its dissemination, and they don't want people to access that content in bulk;
the site's administrators want to prevent malicious actors from accessing the site; such actors often use techniques that are indistinguishable from web-scraping techniques;
the site's administrators want to conserve bandwidth and server resources for non-automated uses of the site.

Web scraping a site that does not permit it may result in your being blocked from accessing the site altogether.

When web scraping a site, always do the following:

Consult the robots.txt file, if it exists. Usually in the root directory of a website, this file specifies which (if any) portions of the site are available to be scraped, and which portions the site administrators have designated as off limits.
Use rate limiting. Sometimes rate limits are specified in the robots.txt file; if they're not, it's best practice to insert a one-second delay between each request (when making automated requests). This delay helps prevent your script from overwhelming the site's servers, and it makes it less likely that your script will be flagged as a malicious actor.
If you obtain data from web scraping, be careful when sharing that data publicly. Even if the data is publicly available on the source website, it may be published under a license that limits or prohibits public re-use. Publicly sharing your dataset where you do not have explicit permission may leave you open to legal challenges.

Web scraping

Out-of-the-Box Software

Data Scraper
A Chrome plug-in to scrape HTML and CSS websites.
Datastreamer
Amass and unify multiple sources into your data products with automated ingestion & schema standardization.
Import.io
For scraping simple and dynamic websites using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, etc.
NCapture
NCapture is a free web-browser extension for Chrome that enables you to gather web content to import into NVivo. (Support for NCapture for Internet Explorer was discontinued in 2020, however the final version is still available if required.)
Octoparse
Octoparse is your no-coding solution for web scraping to turn pages into structured data within clicks.
OpenRefine
OpenRefine is a powerful free, open source tool for working with messy data: cleaning it; transforming it from one format into another; and extending it with web services and external data.
Outwit
Find, extract and organize all kinds of data and media from online sources.
Parsehub
ParseHub is a free and powerful web scraping tool. With our advanced web scraper, extracting data is as easy as clicking on the data you need.
Portia
For scraping simple and dynamic websites (using HTML, CSS, AJAX technologies, JavaScript, redirects, cookies, CAPTCHA tests, etc), news media, social media sites, images, documents, and email.
scraper
A Chrome plug-in for simple HTML websites.
Webscraper.io
Web Scraper can extract data from sites with multiple levels of navigation. It can navigate a website on all levels.
Categories and subcategories
Pagination
Product pages
Webz.io
Webz.io plugs right into your platform and feeds it a steady stream of machine-readable data. It’s as easy as RESTful API.

Programmatic

Python

Beautiful Soup
Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.
pattern
Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.
scrapy
An open source and collaborative framework for extracting the data you need from websites.

R

rvest
rvest helps you scrape (or harvest) data from web pages. It is designed to work with magrittr to make it easy to express common web scraping tasks, inspired by libraries like beautiful soup and RoboBrowser.
selectr
Translates a CSS3 selector into an equivalent XPath expression. This allows us to use CSS selectors when working with the XML package as it can only evaluate XPath expressions. Also provided are convenience functions useful for using CSS selectors on XML nodes. This package is a port of the Python package 'cssselect' ().