Chat with us on the course Piazza site if you have any questions!
Data science incorporates practices from a variety of fields including statistics, machine learning, databases, distributed systems, algorithms, data warehousing, high-performance computing, and visualization. Thus, at a minimum, today's data scientist needs to have familiarity with: data processing and management tools like relational databases and NoSQL for processing large volumes of data; scripting languages like Python for quickly writing programs to clean and transform messy raw data; basic machine learning and data mining algorithms for analyzing the data; statistical computing environments for writing analysis scripts; and visualization tools for presentation and communication of analysis results. This class will study techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Students will learn how to model and reason about data, and how to process and manipulate it in various ways. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, MapReduce, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets.
There will be six programming assignments (PAs) that will explore Database Design and SQL, MapReduce, basic Machine Learning, Data Integration, and Data Visualization.
Course Prerequisites
CS 300 is absolutely essential. CS 400 might be helpful.
Programming Tools
For the programming assignments we will utilize a virtual machine running Ubuntu. You can download the class virtual machine here. We have already installed several required Python packages and software in this virtual machine. You are not required to use this virtual machine but we will only provide support for the environment of the virtual machine during the semester.
If you are not proficient with Python, we recommend that you use the resources described here.
Textbook
Assignments
The reading material listed below is optional, but you are highly encouraged to read it.
# | Date | Topic | Lecture Materials | Extra Reading Material | Assignments |
---|---|---|---|---|---|
Introduction to Data Science | |||||
1 | 1/23 | Intro to Data Science and Class Logistics/Overview | Lecture 1 (pdf) | Chapter 1 from "Doing Data Science" and linked material in the slides | |
2 | 1/25 | Statistical Inference and Exploratory Data Analsysis |
Activity Files: |
Chapter 2 from "Doing Data Science" | PA 0: Virtual Box installation and setup |
3 | 1/28 | Getting Started with Data Analytics: In Class Demonstration | In class demonstration of PA1 | PA 1: Twitter Sentiment Analysis. Due on February, 7th. | |
Relational Databases and Relational Algebra | |||||
4 | 2/1 | Principles of Data Management | Lecture 3 (pdf) | ||
5 | 2/4 | Relational Algebra | Lecture 4 (pdf) | Chapters 3, 4 from Cow book (without relational calculus) | |
6 | 2/6 | SQL for Data Science |
Lecture 5
(pdf)
Lecture 5 in Jupyter Notebook format Activities: Notebook data: dataset_1.db |
PA 2: SQL for Data Science Assignment. Due on February, 19th. | |
7 | 2/8 | No Class | Theo out of town. | ||
8 | 2/11 | No Class | Theo out of town. | ||
9 | 2/13 | Key Principles of RDBMS | Lecture 6 (pdf) | ||
9 | 2/15 | Wrapping up SQL and Databases | Lecture 7 (continue from Lecture 6) | ||
The MapReduce Model and No SQL Systems. | |||||
10 | 2/18 | Reasoning about Scale & The MapReduce Abstraction | Lecture 8 (pdf) | ||
11 | 2/20 | Algorithms in MapReduce 1 | Lecture 9 (pdf) | ||
12 | 2/22 | Algorithms in MapReduce 2 | Lecture 10 (pdf) | PA 3: MapReduce Assignment. Due on March, 7th. | |
13 | 2/25 | No Class | Theo in the Bay Area. | ||
14 | 2/27 | Spark | Lecture 11 (pdf) | ||
15 | 3/1 | NoSQL Systems: KeyValue Stores and Document Stores | Lecture 12 (pdf) | ||
16 | 3/4 | Midterm Review 1 | Midterm Review 1 (pdf) | ||
17 | 3/6 | Midterm Review 2 | Midterm Review 2 (pdf) | ||
18 | 3/8 | Midterm | |||
Predictive Analytics | |||||
19 | 3/11 | Statistical Inference | Lecture 13 (pdf) | ||
20 | 3/13 | Sampling | Lecture 14 (pdf) | ||
21 | 3/15 | Bayesian Methods | Lecture 15 (pdf) | ||
- | 3/16 - 3/24 | Spring Break | |||
22 | 3/25 | Intro to Machine Learning and Decision Trees | Lecture 16 (pdf) | ||
23 | 3/27 | Wrap up from Lecture 16 and Linear Classifiers and Support Vector Machines | Lecture 17 (pdf) | ||
24 | 3/29 | Wrap up Lectures 16 and 17 | Lecture 17 (continued) | ||
25 | 4/1 | Evaluation of Machine Learning Models | Lecture 18 (pdf) | ||
26 | 4/3 | Other Learning Methods: Unsupervised Learning & Ensemble Learning | Lecture 19 (pdf) | Helpful reading for all ML lectures: Python Machine Learning 2nd Edition | PA 4: Classification Assignment. Due on April, 18th. For this assignment you need to participate in the following Kaggle competition. Go here to sign up and participate in the in class competition that corresponds to this problem. You can find a discussion on this submission in this original Kaggle competition here. You can implement your solution using Machine Learning methods for classification from the scikit-learn library. Please use your student id as your team name for this competition. Please upload a zip file with the source code of your solution on Canvas. |
27 | 4/5 | Continue with Unsupervised Learning and Ensemble Learning | Continue lecture 19. | ||
28 | 4/8 | No class | Theo out of Town (at National Science Foundation in DC) | ||
28 | 4/10 | Optimization/Gradient Descent | Lecture 20 (pdf) | ||
29 | 4/12 | Optimization Continued | Lecture 20 continued | ||
Information Extraction and Data Integration | |||||
29 | 4/15 | Information Extraction | Lecture 21 (pdf) | Information Extraction: here | |
30 | 4/17 | Data Integration and Entity Resolution | Lecture 22 (pdf) | Tutorial by Lise Getoor: here | |
31 | 4/19 | Data Cleaning | Lecture 23 (pdf) | Tutorial on Data Cleaning:here Kaggle tutorial on data cleaning: here |
|
Communicating Insights | |||||
32 | 4/22 | Intro to Visualization | Lecture 24 (pdf) | PA 5: Data Preparation Assignment. Due on May, 1st
For this assignment you need to participate in the following Kaggle competition. Go here to sign up and participate in the in class competition that corresponds to this problem. For this task you will need to solve the problem of entity matching over the provided records of products. You can use any technique you want, rule-based entity matching, machine learning-based entity matching etc. You need to implement your solution in Python and provide us with an iPython notebook with your code. Please use your student id as your team name for this competition. Please upload a zip file with the source code of your solution on Canvas. You can find a tutorial on fuzzy string matching here. You can find a convenient fuzzy string matching library here. Feel free to use more advanced entity matching tools (e.g., py_entitymatching). |
|
33 | 4/24 | Data Visualization/EDA | Lecture 25 (pdf) |
|
|
35 | 4/26 | Data Privacy | Lecture 26 (pdf) | ||
Exam Review | |||||
37 | 4/29 | Final Review | Final Review part 1 | Sample Questions Part 1 (solutions) | |
38 | 5/1 | Final Review | Final Review part 2 | ||
38 | 5/3 | Final Review | Final Review part 3 | ||
37 | 5/6 | Bonus Project | Bonus Project | The last bonus project will be open ended. We ask that you create some cool visualizations with the data of the city of Milwaukee. The city of Milwaukee has been posting city-related data in this portal: https://data.milwaukee.gov. There you can find data about public safety, housing and property etc. We ask that you choose one of these datasets and you create a cool visualization with that raw data. You can draw inspiration from the city of New York and the different visualization projects people have been doing there https://opendata.cityofnewyork.us/projects/. Let's prove that Wisconsin is so much cooler than NY! You will have until May 15th 23:59 to submit an ipython notebook with your visualizations. You are free to use any data from the city of Milwaukee you want. This bonus project is worth an extra 10% of the overall grade. However to get the full grade you will have to create something as impressive as the projects posted in the NY portal! Partial credit will also be considered. Given that this is a bonus project you won't have any late dates. Also you really need to impress us :) |
Theo: Monday, Friday 3:15 - 4:15 pm (after class), Wednesday 1:00 - 2:00 pm, or by appointment @ Room CS4361
Huawei Wang: Tuesday, Thursday 9:00 am - 10:00 am @ Room 1301
Frank Zou: Tuesday, Thursday 4:00 pm - 5:00 pm @ Room CS7354
Note: the schedule of office hours may change from time to time, in which case an announcement will be made on the course Piazza.
Theo (thodrek@cs.wisc.edu) | |
Huawei Wang (hwang665@wisc.edu) | |
Frank Zou (szou28@wisc.edu) |
We encourage you to discuss the Programming Assignments with other students; it's fine to discuss overall strategy and collaborate with a partner or in a small group, as both giving and receiving advice will help you to learn.
However, you must write your own solutions to all of the problems, and you must cite all people you worked with.
It's not OK to share code or write code collaboratively. (This includes posting and/or sharing your code publicly, such as on GitHub!)
If you do not do so, we will consider this a violation of the University of Wisconsin Honor Code.
If you consult any resources outside of the materials provided in class, you must cite these sources. We reserve the right to assign a penalty if your answers are substantially derivative, but, as long as you provide appropriate citations, we will not consider this an Honor Code violation.