Professors: Hanspeter Pfister (Computer Science), Joe Blitzstein (Statistics)
Welcome to CS109! The course is also listed as STAT121 and AC209, and offered through the Harvard University Extension School as distance education course CSCI E-109. All lectures and labs will be recorded and the videos will be archived and streamed live during meeting times.
The requirements for these four labelings of the course are the same, except that for students registered for AC209, since they will be receiving graduate-level credit, homeworks and the final project will be held to a higher standard and there may be additional readings.
What is this class about?
This class is about learning from data, in order to gain useful predictions and insights. Separating signal from noise presents many computational and inferential challenges, which we approach from a perspective at the interface of computer science and statistics. Through real-world examples of wide interest, we introduce methods for five key facets of an investigation:
- data munging/scraping/sampling/cleaning in order to get an informative, manageable data set;
- data storage and management in order to be able to access data - especially big data - quickly and reliably during subsequent analysis;
- exploratory data analysis to generate hypotheses and intuition about the data;
- prediction based on statistical tools such as regression, classification, and clustering; and
- communication of results through visualization, stories, and interpretable summaries.
Why take this class?
Hal Varian, Chief Economist at Google, said that:
“The sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it—that’s going to be a hugely important skill.”
More and more applications in industry, academia, and everyday life are – or should be – based on careful analysis of data. For example, consider the book Moneyball (for sports) and the work of Nate Silver (for elections as well as sports). More and more data sets are becoming available, to the point where some companies have described themselves as “drowning in data”. This presents many opportunities, but the right tools from both computer science and statistics are needed so that you can learn from the data without drowning.
Expected Learning Outcomes
After successful completion of this course, you will be able to…
- Use Python and other tools to scrape, clean, and process data
- Use data management techniques to store data locally and in cloud infrastructures
- Use statistical methods and visualization to quickly explore data
- Apply statistics and computational analysis to make predictions based on data
- Apply basic computer science concepts such as modularity, abstraction, and encapsulation to data analysis problems
- Implement data-intensive computations on cluster and cloud infrastructures using MapReduce
- Effectively communicate the outcome of data analysis using descriptive statistics and visualizations
Who should take this class?
The prerequisite for this class is programming knowledge at the level of CS 50 (or above), and statistics knowledge at the level of Stat 100 (or above). Both undergraduates and graduate students are welcome to take the course.
What is the structure of the class?
There will be three major modules, each focusing on an important arena in which data science is playing a crucial role. Within each module, we will study how to gather, explore, and analyze relevant data, as well as how to communicate the results. The major programming language used will be Python.
For each module, there will be two problem sets, in which students will learn about learning from data actively – by doing it! There will also be a major final project, in which students can explore data they are excited about in a more open-ended setting.
The three modules are as follows:
1. Prediction and elections module: how did Nate Silver predict 50 out of 50 states correctly in the 2012 U.S. presidential election, and 49 out of 50 correctly in the 2008 election? How much of that was luck? We will discuss how to find, process, combine, visualize, simulate, and summarize election-related data and questions, especially if there are conflicting polls with different reliabilities.
2. Recommendation and business analytics module: the Neflix Prize was a famous recent example of collaborative filtering: given information about which movies various users have liked and disliked, how should Netflix make recommendations for what movies a user should watch? Many other companies are interested in closely-related problems. Often there is a very large but very sparse data set (e.g., there could be millions of users and tens of thousands of movies, but very few users rate more than a few hundred movies). We will explore techniques for working with such data.
3. Sampling and social network analysis module: social, biological, and technological networks are attracting interest from many fields. They are examples of relational data, in which there are measurements on pairs of individuals, not just on individuals. But computation and visualization for a network with more than, say, 50 nodes (individuals) presents many challenges in scalability and interpretability. We will study techniques for drawing a sample from a network, for analyzing network data (e.g., finding “communities” and “influential” nodes in the network), and for visualizing network data.
Programming knowledge at the level of CS 50 or above, and statistics knowledge at the level of Stat 100 or above (Stat 110 recommended). Extension school students are required to have taken CSCI E-26 and STAT E-150 or above. Exceptions with permission of the instructors.
None. Instead, we have a list of recommended readings on the web site.
Online Discussion Forum
We'll be using Piazza (www.piazza.com) as our online forum. Piazza is your main venue to ask questions, discuss problems, and help each other out. Piazza is a question-and-answer system designed to streamline class discussion outside of the classroom. It should always be your first recourse for seeking answers to your questions about the course, lecture or reading material, or the assignments. Piazza supports LaTeX, code formatting, embedding of images, and attaching of files. We will also use Piazza for all announcements, so it is important that you are signed up.
All lectures and labs will be broadcast live via Adobe Connect, including live video from the classroom, a synchronized view of the projected slides, and a chat window to ask questions and to contact fellow online students. In case the Adobe Connect server fails, you can also connect to an alternate live video feed. Both features will only work during class time. The archived videos of the lectures and labs are available about 24 hours after meeting time from the course homepage.
The staff will hold weekly office hours, either in person or via Skype for distance education students. Office hour times and locations will be listed on Piazza. Office hours provide you with an opportunity to review and discuss course materials as well as provide further guidance for your homework in a more intimate environment, with only your teaching fellow and maybe a handful of classmates present. Online students can make special arrangements directly with their assigned Teaching Fellows to meet on Skype.
The class meets twice a week for lectures and joint class activities. The class activities are designed to help you master the relevant materials, to work on your homework in groups, and to get you started on your project. The weekly schedule of lectures is posted on the course web site.
Lectures are supplemented by weekly 60- to 90-minute labs led by the teaching fellows or guest lecturers. Labs are an important aspect of the course, as we will supplement material from lectures with examples, discuss programming environments (e.g., iPython), and teach you important skills (e.g., linear regression). Lab topics are announced in the schedule.
Towards the end of the course you will work on a month-long data science project. The goal of the project is to go through the complete data science process to answer questions you have about some topic of your own choosing. You will acquire the data, design your visualizations, run statistical analysis, and communicate the results.
You will work closely with other classmates in a 3-4 person project team. You can come up with your own teams and use Piazza to find prospective team members. If you can’t find a partner we will team you up randomly. We recognize that individual schedules, different time zones, preferences, and other constraints might limit your ability to work in a team. If this the case, ask us for permission to work alone.
The homework is going to provide an opportunity to learn data science skills and to test your understanding of the material. See the homework as an opportunity to learn, and not to “earn points”. The homework will also be graded to reflect this objective.
The course schedule includes required readings. The goal of the reading assignments is to prepare for class, to familiarize yourself with new terminology and definitions, and to determine which part of the subject needs more attention. The homework assignments may contain questions about the mandatory readings. When answering those please be brief and to the point!
Your final grade will be determined by the number of points you collect. You can collect various amounts of points for the different parts of the class:
- Project: 50%, assessed on meeting the project criteria.
- Homework: 40%, assessed on your individual submission.
- Participation: 10% assessed on participation on Piazza and lecture and lab attendance.
- Best Projects: We will elect the top three project submissions that will get extra points.
Homework, project, and participation will be graded on a 5 point scale in 0.5 increments using the following scale:
5 = Exceptional / above and beyond (we will only give out these for best projects)
4 = Solid / no mistakes (or really minor)
3 = Good / some mistakes
2 = Fair / some major conceptual errors
1 = Poor / did not finish
0 = Did not participate / did not hand in
A 4 constitutes a perfect grade, and getting all 4s is equivalent to an A. A combination of 4s and 3s end up being A- to B, and so on. Teaching Fellows will evaluate your work holistically beyond mechanical correctness and focus on the overall quality of the work. In addition to the scores the Teaching Fellows will give detailed written feedback.
Project Group Peer Assessment
In the professional world, three important features affect your productivity and success: your own effort, the effort of people you depend on, and the way you work together. For this reason we have chosen a team-based approach that values all three of those features. After the team-based project you will provide an assessment of the contributions of the members of your team, including yourself. Your teammates’ assessment of your contributions and the accuracy of your self-assessment will be considered as part of your overall course evaluation.
You are welcome to discuss the course's ideas, material, and homework with others in order to better understand it, but the work you turn in must be your own (or for the project, yours and your teammate’s). For example, you must write your own code, run your own data analyses, and communicate and explain the results in your own words and with your own visualizations. You may not submit the same or similar work to this course that you have submitted or will submit to another. Nor may you provide or make available solutions to homeworks to individuals who take or may take this course in the future. During the course of the semester, you will complete a number of questionnaires online. The purpose of these questionnaires is to evaluate how well this course works for you. Your answers will only be used to provide feedback on your learning and make adjustments to the course. They will not affect your grade in any way. Unless stated otherwise, you may neither look up any information, nor consult others during these questionnaires.
You must acknowledge any source code that was not written by you by mentioning the original author(s) directly in your source code (comment or header). You can also acknowledge sources in a README.txt file if you used whole classes or libraries. Do not remove any original copyright notices and headers. However, you are encouraged to use libraries, unless explicitly stated otherwise!
You may use examples you find on the web as a starting point, provided its license allows you to re-use it. You must quote the source using proper citations (author, year, title, time accessed, URL) both in the source code and in any publicly visible material. You may not use existing complex combinations or large examples. For example, you may not use a ready to use multiple linked view visualization. You may use parts out of such examples.
Missed Activities and Assignment Deadlines
Projects and homework must be turned in on time, with the exception of late days for homeworks as stated below. It is important that everybody attends and proactively participates in class and online. We understand, however, that certain factors may occasionally interfere with your ability to participate or to hand in work on time. If that factor is an extenuating circumstance, we will ask you to provide documentation directly issued by the University, and we will try to work out an agreeable solution with you (and your teammates).
Homework Deadlines and Late Days
In the weeks when homeworks are due, they will be due on Thursdays at 11:59 pm, unless otherwise announced. Each student is given six late days for homework at the beginning of the semester. A late day extends the individual homework deadline by 24 hours without penalty. No more than two late days may be used on any one assignment. Assignments handed in more than 48 hours after the original deadline will not be graded. If you have already used all of your late days for the semester, we will deduct 1 point for assignments <24 hours late, and 2 points for assignments 24-48 hours late. We do not accept any homeworks under any circumstances more than 48 hours after the original deadline. Late days are intended to give you flexibility: you can use them for any reason – no questions asked. You don't get any bonus points for not using your late days. Also, you can only use late days for the individual homework deadlines – all other deadlines (e.g., project milestones) are hard.
It is very important to us that all assignments are properly graded. If you believe there is an error in your assignment grading, please submit an explanation via email to us (the staff mailing list) within 7 days of receiving the grade. No regrade requests will be accepted orally, and no regrade requests will be accepted more than 7 days after you receive the grade for the assignment.
Guest Lecture Attendance
We are lucky to have some of the world’s best researchers take time out of their busy schedules to give guest lectures. We expect all non-distance students to attend these lectures in person and to engage the speakers with questions and comments. You must send an email to the staff at least one day before a guest lecture to be excused.
If you have a documented disability (physical or cognitive) that may impair your ability to complete assignments or otherwise participate in the course and satisfy course criteria, please meet with us at your earliest convenience to identify, discuss, and document any feasible instructional modifications or accommodations. You should also contact the Accessible Education Office to request an official letter outlining authorized accommodations.
Some of the material in this course is based on other classes. We have also heavily drawn on materials and examples found online and tried our best to give credit by linking to the original source. Please contact us if you find materials where the credit is missing or that you would rather have removed.
User Notice for Copyrighted Materials on Course Websites
This course website, and much of the text, images, graphics, audio and video clips, and other content of the site (collectively, the “Content”), are protected by copyright law. In some cases, the copyright is owned by third parties, and Harvard is making the third-party Content available to you under the fair use doctrine. Fair use permits only certain limited uses of the Content. You may use the website and its Content only for your personal, noncommercial educational and scholarly use. Some Content may be provided via streaming or other means that restrict copying; you may not circumvent those restrictions. If you wish to distribute or make any of the Content available to others, or to use any Content commercially, or to use any Content for any purpose other than your personal, noncommercial educational and scholarly use, you must obtain any required permission from the copyright holder. User notice courtesy of the Harvard University Office of General Counsel.