Building Text Collections with Web Scraping
As more and more primary and secondary source material appears on websites, it is becoming increasingly important for scholars and librarians to learn how to collect this material at scale. Participants in this workshop will learn how to create a textual corpus using web scraping techniques.
We will begin with a discussion of why web scraping can be a useful technique (for analysis, archiving, etc.) as well as legal and ethical questions associated with scraping data. Using this discussion as our theoretical foundation, we will then write a program together using Python’s Beautiful Soup library that scrapes textual data from websites.
There are many types of websites and many types of data that can be gathered, so we will go through several examples so that participants learn strategies for scraping in a variety of contexts. Finally, once the data is gathered, we will perform a few additional steps to clean the data and prepare it for the next phase of analysis. By the end of the workshop, participants will be comfortable using a variety of web scraping techniques, and will be able to generate a corpus and to prepare it for further analysis.
This workshop is open to all, including beginners.
- Wednesday, March 13, 2019
- 10:30am - 12:00pm
- Bobst Library, Rm. 617, 6th Floor
- Bobst Library