Invited workshop offered at the NCSU Libraries’ Data Science & Visualizatioin Institute for Librarians. 2018.
Web Scraping: Gathering Data From Websites
Preexisting and clean data sets such as the General Social Survey (GSS) or Census data are readily available, cover long periods of time, and have well documented codebooks. Meanwhile, researchers increasingly want to gather their own data from websites, which introduces a different layer of complexity. Accessing content from web sources requires different tools and new techniques. In this workshop we will use webscraper.io to crawl a website and gather text from an online newsletter.
Data Cleaning And Preparation
OpenRefine is a tool used to impose structure upon semi-structured data. The often-intuitive interface is a great convenience. Its powerful and extensible method for normalizing data makes OpenRefine a “go to” option for quick and easy data transformations. Categorical facets can be exposed for simple data clean-up. Bulk data clustering options are so easy that the process looks like nerdy fun. Few tools are better suited for bulk data cleaning. This hands-on session will explore how Refine can help with common data cleaning challenges.
Parsing Html & Json, Orchestrating Apis, And Gathering Twitter Streams
As time allows we will build on our newly developed OpenRefine knowledge to move beyond beginner Web Scraping techniques. Using OpenRefine, we will gather and clean data from less structured web pages. Then, following a discussion about Application Programming Interfaces (API), we will use the TAGS tool to gather Twitter data.
Image Credit: on previous page by Merrill College of Journalism