DSVIL 2018

DSVIL 2018

DSVIL is a week-long course for librarians to immerse themselves in learning about data science and visualization. Here’s how to apply

I’m excited to receive a return and expanded invitation to teach at NCSU Libraries’s Data Science and Visualization Institute for Librarians. Last year I taught an engaged group of librarians how to scrape data from web pages using OpenRefine and webscraper.io.

This year I have the opportunity to expand my offering. The schedule is posted at the DSVIL site. I teach on Wednesday, June 6. Below is a draft of what I plan to cover.

Web Scraping

Preexisting and clean data sets such as the General Social Survey (GSS) or Census data are readily available, cover long periods of time, and have well documented codebooks. Meanwhile, researchers increasingly want to gather their own data from websites, which introduces a different layer of complexity. Accessing content from web sources requires different tools and new techniques. In this workshop we will use webscraper.io to crawl a website and gather text from an online newsletter.

Data Cleaning

OpenRefine is a tool used to impose structure upon semi-structured data. The often-intuitive interface is a great convenience. Powerful and extensible methods for normalizing data make OpenRefine a “go to” option for quick and easy data transformations. Categorical facets can be easily exposed and then leveraged for simple data cleanup. Bulk data clustering options are so easy that the process feels like good nerdy fun. Few tools are better suited for bulk data cleaning. This hands-on workshop will explore how Refine can help with common data cleaning challenges.

Parsing HTML/JSON, API Orchestration, and Twitter

As time allows we will build on our newly developed OpenRefine knowledge to move beyond beginner Web Scraping techniques. Using OpenRefine, we will gather and clean data from less structured web pages. Following a discussion about Application Programming Interfaces (API), we will use the TAGS tool to gather Twitter data.