Image credit:

Pragmatic Datafication: data cleaning, web scraping, twitter gathering, and parsing


Preexisting and clean data sets such as the General Social Survey (GSS) or Census data are readily available, cover long periods of time, and have well documented codebooks. Meanwhile, researchers increasingly want to gather their own data from websites, which introduces a different layer of complexity. Use easily accessible tools to impose structure upon semi-structured data.



Slides are divided into the following sections

  1. Index
  2. Introduction
  3. Web Scraping
  4. OpenRefine: Data Cleaning Basics
  5. OpenRefine: Reconciliation
  6. Capturing Twitter Data
  7. APIs & JSON Parsing
  8. More HTML Parsing


Web Scraping

  1. Web Scraping

Data Cleaning

  1. Data Cleaning – Basic Transformation with OpenRefine (Exercise 1)
  2. Data Cleaning – GREL (Exercise 2)
  3. Reconciliation with OpenRefine

Social Media

  1. Social Media – Twitter gathering with TAGS app (Exercise 1)
  2. Social Media – Twitter: TAGS visualization and tools

API & JSON Parsing

  1. APIs & JSON parsing – OpenRefine (exercise 1)
  2. APIs – using API Keys (exercise 2)

HTML Parsing

  1. Intro HTML Parsing: Steps 1 -6 (exercise 1)
  2. More OpenRefine – Looping Control: Steps 7-end (exercise 2 – This section will introduced more advanced features of OpenRefine using HTML parsing as the example exercise)