COMP 598: Fall 2020


Data science is a cornerstone for our modern age. In the right hands, the tools of data science can transform large, unstructured piles of data into key insights that inform business decisions, automate industrial processes, or deliver nuanced understanding of opportunities or challenges. Data science is quickly becoming one of the major tools that governments, companies, and even individuals use to make important decisions.

That such important and wide-ranging decisions are based on the analysis of large and unweildy data sets highlights to importance of doing it right. This is, at its heart, the purpose of this class: to teach you to the fundamentals of how to use the powerful tools of data science responsibly.

As a result, this class will take a holistic perspective on the practice of data science. The class will be technical – we will learn a diverse array of techniques ranging from data scraping to statistical modeling to visualization. The class will also be reflexive – we will develop an awareness of how even very well-intentioned analyses can completely misrepresent the real-world and lead to wrong insights. We will learn how to avoid this.

My goal is for you to leave this class both capable of applying data science in the real-world and cautious to ensure that you do so responsibly.


Despite the course number (598), this course is intended to be introductory, suitable for undergraduates who have fundamental programming skills. If you have taken COMP 250 or have the equivalent experience, you’ll have all you need.


Given the unprecedented challenges presented by the COVID-19 pandemic, the exact plan for the course is in-flux. Please treat the information below as a rough guide to the content and structure of the course.


Course content will be covered through a combination of weekly recorded video lectures, assigned reading, and small group sessions.

Small group sessions. The small group sessions will also be virtual – and will focus around working as a group to complete various data science tasks. These will afford an opportunity to discuss and explore data science tools and concepts together with classmates, TAs, and the professor.

Homework. The homework assignments will center around applying data science tools. Most will involve substantial coding exercises to, for example, scrape websites, design and interact with database schema, or perform statistical analysis of a particular dataset.

Exams. We’re still waiting to learn more about how exams will be handled in the fall. It is possible that we will not have an exam – in favor of a substantial data science semester project, report, and presentation.


The schedule below captures the flow we’ll follow, though the exact week-to-topic mapping is quite tentative.

Week 1Fundamentals (The data science process)
Week 2Infrastructure (AWS, Unix, and Jupyter)
Week 3Question Formulation
Week 4Data Collection (Scraping & APIs, data organization)
Week 5Data Annotation (Keywords & manual coding)
Week 6Data Annotation (Crowd sourcing)
Week 7Modeling (Natural language processing)
Week 8Modeling (Bias & statistical significance)
Week 9Analysis (Visualization)
Week 10Analysis (Characterizing error)
Week 12Communication (Presentation)
Week 13Communication (Writing)