COMP 598: Fall 2020
Overview
Data science is a cornerstone for our modern age. In the right hands, the tools of data science can transform large, unstructured piles of data into key insights that inform business decisions, automate industrial processes, or deliver nuanced understanding of opportunities or challenges. Data science is quickly becoming one of the major tools that governments, companies, and even individuals use to make important decisions.
That such important and wide-ranging decisions are based on the analysis of large and unweildy data sets highlights to importance of doing it right. This is, at its heart, the purpose of this class: to teach you to the fundamentals of how to use the powerful tools of data science responsibly.
As a result, this class will take a holistic perspective on the practice of data science. The class will be technical – we will learn a diverse array of techniques ranging from data scraping to statistical modeling to visualization. The class will also be reflexive – we will develop an awareness of how even very well-intentioned analyses can completely misrepresent the real-world and lead to wrong insights. We will learn how to avoid this.
My goal is for you to leave this class both capable of applying data science in the real-world and cautious to ensure that you do so responsibly.
Requirements
Despite the course number (598), this course is intended to be introductory, suitable for undergraduates who have fundamental programming skills. If you have taken COMP 250 or have the equivalent experience, you’ll have all you need.
Syllabus
Given the unprecedented challenges presented by the COVID-19 pandemic, the exact plan for the course is in-flux. Please treat the information below as a rough guide to the content and structure of the course.
Structure
Course content will be covered through a combination of weekly recorded video lectures, assigned reading, and small group sessions.
Small group sessions. The small group sessions will also be virtual – and will focus around working as a group to complete various data science tasks. These will afford an opportunity to discuss and explore data science tools and concepts together with classmates, TAs, and the professor.
Homework. The homework assignments will center around applying data science tools. Most will involve substantial coding exercises to, for example, scrape websites, design and interact with database schema, or perform statistical analysis of a particular dataset.
Exams. We’re still waiting to learn more about how exams will be handled in the fall. It is possible that we will not have an exam – in favor of a substantial data science semester project, report, and presentation.
Content
The schedule below captures the flow we’ll follow, though the exact week-to-topic mapping is quite tentative.
Week | Topic |
Week 1 | Fundamentals (The data science process) |
Week 2 | Infrastructure (AWS, Unix, and Jupyter) |
Week 3 | Question Formulation |
Week 4 | Data Collection (Scraping & APIs, data organization) |
Week 5 | Data Annotation (Keywords & manual coding) |
Week 6 | Data Annotation (Crowd sourcing) |
Week 7 | Modeling (Natural language processing) |
Week 8 | Modeling (Bias & statistical significance) |
Week 9 | Analysis (Visualization) |
Week 10 | Analysis (Characterizing error) |
Week 12 | Communication (Presentation) |
Week 13 | Communication (Writing) |