CS639 Data Management for Data Science

Birge Hall 145 on MonWedFri 2:25-3:15pm

Chat with us on the course Piazza site if you have any questions!

Description

Data science incorporates practices from a variety of fields including statistics, machine learning, databases, distributed systems, algorithms, data warehousing, high-performance computing, and visualization. Thus, at a minimum, today's data scientist needs to have familiarity with: data processing and management tools like relational databases and NoSQL for processing large volumes of data; scripting languages like Python for quickly writing programs to clean and transform messy raw data; basic machine learning and data mining algorithms for analyzing the data; statistical computing environments for writing analysis scripts; and visualization tools for presentation and communication of analysis results. This class will study techniques and systems for ingesting, efficiently processing, analyzing, and visualizing large data sets. Students will learn how to model and reason about data, and how to process and manipulate it in various ways. Topics will include data cleaning, data integration, scalable systems (relational databases, NoSQL, MapReduce, etc.), analytics (data cubes, scalable statistics and machine learning), and scalable visualization of large data sets.

There will be six programming assignments (PAs) that will explore Database Design and SQL, MapReduce, basic Machine Learning, Data Integration, and Data Visualization.

Announcements
  • NEW! If you want to dive deeper into ML and the theory behind it read the notes by Percy Liang.
  • Do you want to see what it means to be doing data science for real? We have several opportunities for undergraduate research in my lab and we are always willing to work with highly motivated students! If you are interested in research opportunities do not hesitate to contact me. The best way to engage is by shooting me an email with your CV and then dropping by my office hours. There are many exciting projects ranging from data cleaning to knowledge base construction to training deep learning models for data imputation to front-end design for scientific applications.
  • You can find more information about using Jupyter notebooks here.
  • You can find more information about Python here.
Class Logistics

Course Prerequisites

  • CS 300 is absolutely essential. CS 400 might be helpful.

Programming Tools

  • For the programming assignments we will utilize a virtual machine running Ubuntu. You can download the class virtual machine here. We have already installed several required Python packages and software in this virtual machine. You are not required to use this virtual machine but we will only provide support for the environment of the virtual machine during the semester.

  • If you are not proficient with Python, we recommend that you use the resources described here.

Textbook

  • There is no required textbook for the course. Lecture slides will be self-contained. Additional readings will be posted together with the lecture slides.
  • The following two books can be useful for you to consult. Both books are availble for free online at Safari Books Online if you are on the UW network.

Assignments

  • All programming assignments are individual assignments. Each student must send us an individual submission.
  • Programming assignments are due by the end of day on the indicated dates.
  • All assignments will be submitted via Canvas.
  • You are allowed 5 free late days to use throughout the semester. One late day equals one 24 hour period after the due date of the assignment. Once you have used your late days, there will be a 20% penalty for each day an assignment is late.
  • The honor code described below will be enforced for both types of assignments.

Lecture Plan

The reading material listed below is optional, but you are highly encouraged to read it.


# Date Topic Lecture Materials Extra Reading Material Assignments
Introduction to Data Science
1 1/23 Intro to Data Science and Class Logistics/Overview Lecture 1 (pdf) Chapter 1 from "Doing Data Science" and linked material in the slides
2 1/25 Statistical Inference and Exploratory Data Analsysis

Lecture 2 (pdf)

Activity Files:

Chapter 2 from "Doing Data Science" PA 0: Virtual Box installation and setup
3 1/28 Getting Started with Data Analytics: In Class Demonstration In class demonstration of PA1 PA 1: Twitter Sentiment Analysis. Due on February, 7th.

PA1 Jupyter notebook (pdf version)

PA1 files (zip format)

Relational Databases and Relational Algebra
4 2/1 Principles of Data Management Lecture 3 (pdf)
5 2/4 Relational Algebra Lecture 4 (pdf) Chapters 3, 4 from Cow book (without relational calculus)
6 2/6 SQL for Data Science Lecture 5 (pdf)

Lecture 5 in Jupyter Notebook format

Activities:
Notebook data: dataset_1.db
PA 2: SQL for Data Science Assignment. Due on February, 19th.

PA2 Jupyter notebook

PA2 database

Submission Template

7 2/8 No Class Theo out of town.
8 2/11 No Class Theo out of town.
9 2/13 Key Principles of RDBMS Lecture 6 (pdf)
9 2/15 Wrapping up SQL and Databases Lecture 7 (continue from Lecture 6)
The MapReduce Model and No SQL Systems.
10 2/18 Reasoning about Scale & The MapReduce Abstraction Lecture 8 (pdf)
11 2/20 Algorithms in MapReduce 1 Lecture 9 (pdf)
12 2/22 Algorithms in MapReduce 2 Lecture 10 (pdf) PA 3: MapReduce Assignment. Due on March, 7th.

Programming Assignment 3 Zip File

13 2/25 No Class Theo in the Bay Area.
14 2/27 Spark Lecture 11 (pdf)
15 3/1 NoSQL Systems: KeyValue Stores and Document Stores Lecture 12 (pdf)
16 3/4 Midterm Review 1 Midterm Review 1 (pdf)
17 3/6 Midterm Review 2 Midterm Review 2 (pdf)
18 3/8 Midterm
Predictive Analytics
19 3/11 Statistical Inference Lecture 13 (pdf)
20 3/13 Sampling Lecture 14 (pdf)
21 3/15 Bayesian Methods Lecture 15 (pdf)
- 3/16 - 3/24 Spring Break
22 3/25 Intro to Machine Learning and Decision Trees Lecture 16 (pdf)
23 3/27 Wrap up from Lecture 16 and Linear Classifiers and Support Vector Machines Lecture 17 (pdf)
24 3/29 Wrap up Lectures 16 and 17 Lecture 17 (continued)
25 4/1 Evaluation of Machine Learning Models Lecture 18 (pdf)
26 4/3 Other Learning Methods: Unsupervised Learning & Ensemble Learning Lecture 19 (pdf) Helpful reading for all ML lectures: Python Machine Learning 2nd Edition

PA 4: Classification Assignment. Due on April, 18th.

For this assignment you need to participate in the following Kaggle competition. Go here to sign up and participate in the in class competition that corresponds to this problem.

You can find a discussion on this submission in this original Kaggle competition here.

You can implement your solution using Machine Learning methods for classification from the scikit-learn library.

Please use your student id as your team name for this competition.

Please upload a zip file with the source code of your solution on Canvas.

27 4/5 Continue with Unsupervised Learning and Ensemble Learning Continue lecture 19.
28 4/8 No class Theo out of Town (at National Science Foundation in DC)
28 4/10 Optimization/Gradient Descent Lecture 20 (pdf)
29 4/12 Optimization Continued Lecture 20 continued
Information Extraction and Data Integration
29 4/15 Information Extraction Lecture 21 (pdf) Information Extraction: here
30 4/17 Data Integration and Entity Resolution Lecture 22 (pdf) Tutorial by Lise Getoor: here
31 4/19 Data Cleaning Lecture 23 (pdf) Tutorial on Data Cleaning:here

Kaggle tutorial on data cleaning: here

Communicating Insights
32 4/22 Intro to Visualization Lecture 24 (pdf) PA 5: Data Preparation Assignment. Due on May, 1st

For this assignment you need to participate in the following Kaggle competition. Go here to sign up and participate in the in class competition that corresponds to this problem.

For this task you will need to solve the problem of entity matching over the provided records of products. You can use any technique you want, rule-based entity matching, machine learning-based entity matching etc.

You need to implement your solution in Python and provide us with an iPython notebook with your code.

Please use your student id as your team name for this competition.

Please upload a zip file with the source code of your solution on Canvas.

You can find a tutorial on fuzzy string matching here.

You can find a convenient fuzzy string matching library here.

Feel free to use more advanced entity matching tools (e.g., py_entitymatching).

33 4/24 Data Visualization/EDA Lecture 25 (pdf)
35 4/26 Data Privacy Lecture 26 (pdf)
Exam Review
37 4/29 Final Review Final Review part 1 Sample Questions Part 1 (solutions)
38 5/1 Final Review Final Review part 2
38 5/3 Final Review Final Review part 3
37 5/6 Bonus Project Bonus Project

The last bonus project will be open ended. We ask that you create some cool visualizations with the data of the city of Milwaukee.

The city of Milwaukee has been posting city-related data in this portal: https://data.milwaukee.gov. There you can find data about public safety, housing and property etc.

We ask that you choose one of these datasets and you create a cool visualization with that raw data. You can draw inspiration from the city of New York and the different visualization projects people have been doing there https://opendata.cityofnewyork.us/projects/.

Let's prove that Wisconsin is so much cooler than NY!

You will have until May 15th 23:59 to submit an ipython notebook with your visualizations. You are free to use any data from the city of Milwaukee you want.

This bonus project is worth an extra 10% of the overall grade. However to get the full grade you will have to create something as impressive as the projects posted in the NY portal! Partial credit will also be considered. Given that this is a bonus project you won't have any late dates. Also you really need to impress us :)

Midterm Exam
The midterm exam will be in class on March 8th from 2:25pm - 3:15pm. The location will be:
  • Birge Hall 145
Final Exam
The final exam will be on TDB.
Grading
Programming Assignments45%
Midterm20%
Final35%
Bonus Project10% (This is Bonus and will be awarded to impressive projects)
Office Hours

Theo: Monday, Friday 3:15 - 4:15 pm (after class), Wednesday 1:00 - 2:00 pm, or by appointment @ Room CS4361

Huawei Wang: Tuesday, Thursday 9:00 am - 10:00 am @ Room 1301

Frank Zou: Tuesday, Thursday 4:00 pm - 5:00 pm @ Room CS7354

Note: the schedule of office hours may change from time to time, in which case an announcement will be made on the course Piazza.

Staff
Theo (thodrek@cs.wisc.edu)
Huawei Wang (hwang665@wisc.edu)
Frank Zou (szou28@wisc.edu)
Late Policy
You are allowed 5 free late days to use throughout the semester. One late day equals one 24 hour period after the due date of the assignment. Once you have used your late days, there will be a 20% penalty for each day an assignment is late.
Honor Code and Collaboration Policy

We encourage you to discuss the Programming Assignments with other students; it's fine to discuss overall strategy and collaborate with a partner or in a small group, as both giving and receiving advice will help you to learn.

However, you must write your own solutions to all of the problems, and you must cite all people you worked with.

It's not OK to share code or write code collaboratively. (This includes posting and/or sharing your code publicly, such as on GitHub!)

If you do not do so, we will consider this a violation of the University of Wisconsin Honor Code.

If you consult any resources outside of the materials provided in class, you must cite these sources. We reserve the right to assign a penalty if your answers are substantially derivative, but, as long as you provide appropriate citations, we will not consider this an Honor Code violation.