CS639 Data Management for Data Science

Birge Hall 145 on MonWedFri 2:25-3:15pm

Chat with us on the course Piazza site if you have any questions!

Description

Data science incorporates practices from a variety of fields including statistics, machine learning, databases, distributed systems, algorithms, data warehousing, high-performance computing, and visualization. Thus, at a minimum, today's data scientist needs to have familiarity with: data processing and management tools like relational databases and NoSQL for processing large volumes of data; scripting languages like Python for quickly writing programs to clean and transform messy raw data; basic machine learning and data mining algorithms for analyzing the data; statistical computing environments for writing analysis scripts; and visualization tools for presentation and communication of analysis results. This class will study techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Students will learn how to model and reason about data, and how to process and manipulate it in various ways. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, MapReduce, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets.

There will be six programming assignments (PAs) that will explore Database Design and SQL, MapReduce, basic Machine Learning, Data Integration, and Data Visualization.

Announcements

NEW! If you want to dive deeper into ML and the theory behind it read the notes by Percy Liang.
Do you want to see what it means to be doing data science for real? We have several opportunities for undergraduate research in my lab and we are always willing to work with highly motivated students! If you are interested in research opportunities do not hesitate to contact me. The best way to engage is by shooting me an email with your CV and then dropping by my office hours. There are many exciting projects ranging from data cleaning to knowledge base construction to training deep learning models for data imputation to front-end design for scientific applications.
You can find more information about using Jupyter notebooks here.
You can find more information about Python here.

Class Logistics

Course Prerequisites

CS 300 is absolutely essential. CS 400 might be helpful.

Programming Tools

For the programming assignments we will utilize a virtual machine running Ubuntu. You can download the class virtual machine here. We have already installed several required Python packages and software in this virtual machine. You are not required to use this virtual machine but we will only provide support for the environment of the virtual machine during the semester.
If you are not proficient with Python, we recommend that you use the resources described here.

Textbook

There is no required textbook for the course. Lecture slides will be self-contained. Additional readings will be posted together with the lecture slides.
The following two books can be useful for you to consult. Both books are availble for free online at Safari Books Online if you are on the UW network.
- Python for Data Analysis, Wes McKinney, 2012
- Doing Data Science, Cathy O'Neil and Rachel Schutt, 2013

Assignments

All programming assignments are individual assignments. Each student must send us an individual submission.
Programming assignments are due by the end of day on the indicated dates.
All assignments will be submitted via Canvas.
You are allowed 5 free late days to use throughout the semester. One late day equals one 24 hour period after the due date of the assignment. Once you have used your late days, there will be a 20% penalty for each day an assignment is late.
The honor code described below will be enforced for both types of assignments.

Lecture Plan

The reading material listed below is optional, but you are highly encouraged to read it.

#	Date	Topic	Lecture Materials	Extra Reading Material	Assignments
Introduction to Data Science
1	1/23	Intro to Data Science and Class Logistics/Overview	Lecture 1 (pdf)	Chapter 1 from "Doing Data Science" and linked material in the slides
2	1/25	Statistical Inference and Exploratory Data Analsysis	Lecture 2 (pdf) Activity Files: Notebook Data	Chapter 2 from "Doing Data Science"	PA 0: Virtual Box installation and setup
3	1/28	Getting Started with Data Analytics: In Class Demonstration	In class demonstration of PA1		PA 1: Twitter Sentiment Analysis. Due on February, 7th. PA1 Jupyter notebook (pdf version) PA1 files (zip format)
Relational Databases and Relational Algebra
4	2/1	Principles of Data Management	Lecture 3 (pdf)	What goes around comes around, M. Stonebraker, J. Hellerstein Chapter 1 from Cow book
5	2/4	Relational Algebra	Lecture 4 (pdf)	Chapters 3, 4 from Cow book (without relational calculus)
6	2/6	SQL for Data Science	Lecture 5 (pdf) Lecture 5 in Jupyter Notebook format Activities: 2-1 (Solutions) 2-2 (Solutions) 2-3 (Solutions) 3-1 (Solutions) Notebook data: dataset_1.db	Greenspun, SQL for Nerds SQL w3 tutorial (Exercises with Solutions) SQL Tutorial by SQL Zoo	PA 2: SQL for Data Science Assignment. Due on February, 19th. PA2 Jupyter notebook PA2 database Submission Template
7	2/8	No Class	Theo out of town.
8	2/11	No Class	Theo out of town.
9	2/13	Key Principles of RDBMS	Lecture 6 (pdf)	An Overview of Query Optimization in Relational Systems The Transaction Concept: Virtues and Limitations Overview of Transaction Management
9	2/15	Wrapping up SQL and Databases	Lecture 7 (continue from Lecture 6)
The MapReduce Model and No SQL Systems.
10	2/18	Reasoning about Scale & The MapReduce Abstraction	Lecture 8 (pdf)	MapReduce: Simplified Data Processing on Large Clusters
11	2/20	Algorithms in MapReduce 1	Lecture 9 (pdf)	MapReduce Algorithms from Cloudera Basic MapReduce Algorithm Design MapReduce Algorithm Design
12	2/22	Algorithms in MapReduce 2	Lecture 10 (pdf)		PA 3: MapReduce Assignment. Due on March, 7th. Programming Assignment 3 Zip File
13	2/25	No Class	Theo in the Bay Area.
14	2/27	Spark	Lecture 11 (pdf)
15	3/1	NoSQL Systems: KeyValue Stores and Document Stores	Lecture 12 (pdf)
16	3/4	Midterm Review 1	Midterm Review 1 (pdf)
17	3/6	Midterm Review 2	Midterm Review 2 (pdf)
18	3/8	Midterm
Predictive Analytics
19	3/11	Statistical Inference	Lecture 13 (pdf)
20	3/13	Sampling	Lecture 14 (pdf)
21	3/15	Bayesian Methods	Lecture 15 (pdf)
-	3/16 - 3/24	Spring Break
22	3/25	Intro to Machine Learning and Decision Trees	Lecture 16 (pdf)
23	3/27	Wrap up from Lecture 16 and Linear Classifiers and Support Vector Machines	Lecture 17 (pdf)
24	3/29	Wrap up Lectures 16 and 17	Lecture 17 (continued)
25	4/1	Evaluation of Machine Learning Models	Lecture 18 (pdf)
26	4/3	Other Learning Methods: Unsupervised Learning & Ensemble Learning	Lecture 19 (pdf)	Helpful reading for all ML lectures: Python Machine Learning 2nd Edition	PA 4: Classification Assignment. Due on April, 18th. For this assignment you need to participate in the following Kaggle competition. Go here to sign up and participate in the in class competition that corresponds to this problem. You can find a discussion on this submission in this original Kaggle competition here. You can implement your solution using Machine Learning methods for classification from the scikit-learn library. Please use your student id as your team name for this competition. Please upload a zip file with the source code of your solution on Canvas.
27	4/5	Continue with Unsupervised Learning and Ensemble Learning	Continue lecture 19.
28	4/8	No class	Theo out of Town (at National Science Foundation in DC)
28	4/10	Optimization/Gradient Descent	Lecture 20 (pdf)
29	4/12	Optimization Continued	Lecture 20 continued
Information Extraction and Data Integration
29	4/15	Information Extraction	Lecture 21 (pdf)	Information Extraction: here
30	4/17	Data Integration and Entity Resolution	Lecture 22 (pdf)	Tutorial by Lise Getoor: here
31	4/19	Data Cleaning	Lecture 23 (pdf)	Tutorial on Data Cleaning:here Kaggle tutorial on data cleaning: here
Communicating Insights
32	4/22	Intro to Visualization	Lecture 24 (pdf)		PA 5: Data Preparation Assignment. Due on May, 1st For this assignment you need to participate in the following Kaggle competition. Go here to sign up and participate in the in class competition that corresponds to this problem. For this task you will need to solve the problem of entity matching over the provided records of products. You can use any technique you want, rule-based entity matching, machine learning-based entity matching etc. You need to implement your solution in Python and provide us with an iPython notebook with your code. Please use your student id as your team name for this competition. Please upload a zip file with the source code of your solution on Canvas. You can find a tutorial on fuzzy string matching here. You can find a convenient fuzzy string matching library here. Feel free to use more advanced entity matching tools (e.g., py_entitymatching).
33	4/24	Data Visualization/EDA	Lecture 25 (pdf)	Vartak et al. SeeDB, VLDB 2015 Enabling Data Science for the Majority (Read but no QCRs) Siddiqui et al. ZenVisage, VLDB 2017 (Optional)
35	4/26	Data Privacy	Lecture 26 (pdf)
Exam Review
37	4/29	Final Review	Final Review part 1	Sample Questions Part 1 (solutions)
38	5/1	Final Review	Final Review part 2
38	5/3	Final Review	Final Review part 3
37	5/6	Bonus Project	Bonus Project		The last bonus project will be open ended. We ask that you create some cool visualizations with the data of the city of Milwaukee. The city of Milwaukee has been posting city-related data in this portal: https://data.milwaukee.gov. There you can find data about public safety, housing and property etc. We ask that you choose one of these datasets and you create a cool visualization with that raw data. You can draw inspiration from the city of New York and the different visualization projects people have been doing there https://opendata.cityofnewyork.us/projects/. Let's prove that Wisconsin is so much cooler than NY! You will have until May 15th 23:59 to submit an ipython notebook with your visualizations. You are free to use any data from the city of Milwaukee you want. This bonus project is worth an extra 10% of the overall grade. However to get the full grade you will have to create something as impressive as the projects posted in the NY portal! Partial credit will also be considered. Given that this is a bonus project you won't have any late dates. Also you really need to impress us :)

Midterm Exam

The midterm exam will be in class on March 8th from 2:25pm - 3:15pm. The location will be:

Birge Hall 145

Final Exam

The final exam will be on TDB.

Grading

Programming Assignments	45%
Midterm	20%
Final	35%
Bonus Project	10% (This is Bonus and will be awarded to impressive projects)

Office Hours

Theo: Monday, Friday 3:15 - 4:15 pm (after class), Wednesday 1:00 - 2:00 pm, or by appointment @ Room CS4361

Huawei Wang: Tuesday, Thursday 9:00 am - 10:00 am @ Room 1301

Frank Zou: Tuesday, Thursday 4:00 pm - 5:00 pm @ Room CS7354

Note: the schedule of office hours may change from time to time, in which case an announcement will be made on the course Piazza.

Staff


	Theo (thodrek@cs.wisc.edu)
	Huawei Wang (hwang665@wisc.edu)
	Frank Zou (szou28@wisc.edu)

Late Policy

You are allowed 5 free late days to use throughout the semester. One late day equals one 24 hour period after the due date of the assignment. Once you have used your late days, there will be a 20% penalty for each day an assignment is late.

Honor Code and Collaboration Policy

We encourage you to discuss the Programming Assignments with other students; it's fine to discuss overall strategy and collaborate with a partner or in a small group, as both giving and receiving advice will help you to learn.

However, you must write your own solutions to all of the problems, and you must cite all people you worked with.

It's not OK to share code or write code collaboratively. (This includes posting and/or sharing your code publicly, such as on GitHub!)

If you do not do so, we will consider this a violation of the University of Wisconsin Honor Code.

If you consult any resources outside of the materials provided in class, you must cite these sources. We reserve the right to assign a penalty if your answers are substantially derivative, but, as long as you provide appropriate citations, we will not consider this an Honor Code violation.