Practice Exercise #6 – Scraping data from HTML pages

CMN 316: Python + Data Tools Practice Exercise #6 – Scraping data from HTML pages

Assignment Summary

This exercise builds off of the previous week’s Pandas II activity. In this scenario, we will scrape HTML (hypertext markup language) pages from the web using Beautiful Soup. We will freeze a “snapshot” of this page into a python program, and then work with the data as a Pandas dataframe. While we won’t analyze the data in-depth just yet, we will clean up the data.

Video

Today’s Assignment

Task 1: Select a wikipedia page for a TV show of your choosing and scrap its data into a Pandas dataframe using the Beautiful Soup template below. Please be sure to ‘make a copy’ (or just copy and paste) this example — do not edit directly!

Task 2: Clean up the data, removing “\n” delimiters and setting up your dataframe columns (as per the template below). Try commenting out (turning “off” steps in the program) to see how the python code is preparing the data.

Template

Resources

Previous Module

Back to Course Workshop

Next Module