Chat with us on the course Piazza site if you have any questions!
Class repository: CS839 Drive
We are seeing widespread investments in machine learning (ML) that enable computers to interpret what they see, communicate in natural language, answer complex questions, and interact with their environment. There is a hidden catch, however: all state-of-the-art ML systems rely on high-effort data management tasks like data exploration, data preparation and data cleaning. The goal of this seminar course is to study data management challenges that arise in the context of machine learning pipelines. The focus will be on cutting-edge problems in the context of ML pipelines, related to (1) data exploration and understanding, (2) data integration, cleaning, and validation, and (3) data preparation for ML models and serving of production ML applications. The seminar will be very interactive and collaborative. The topics covered and the depth of coverage will depend on the participants' input and interests. The goal of the course is to give you an indepth look at an important, emerging topic in data management research. This course will provide research opportunities in the areas of data management, human-computer interaction, and machine learning. Along the way, you will also pick up some practical experience in reading and presenting research papers, synthesizing research across desperate areas, using existing tools, and doing a course project that ideally will lead to a publishable paper.
Class Format
This is a seminar course. Each class will consist of presentations and discussion. Students will be required to do a class project for the course (60%). A significant portion of the grade will be based on class participation, which includes paper presentations, contributions to paper reviews, and paper discussions (40%). Because of the interactive nature of the course, and space limitations, auditing is discouraged.
Prerequisites
Mathematical maturity and a basic course in probability required. Background in algorithms, databases, machine learning, and graphical models suggested.
Assignments
Misc
# | Date | Topic | Lecture Materials | Reading Material | Assignments |
---|---|---|---|---|---|
Introduction and Class Overview | |||||
1 | 1/23 | Logistics and Data Management for Production ML | Lecture 1 |
|
|
2 | 1/25 | DB and ML integration: A systems percpective | Lecture 2 |
|
|
Data Exploration and Understanding | |||||
3 | 1/30 | What to expect from data: Data driven visualizations | Lecture 3 |
|
|
4 | 2/1 | From data cubes to feature-based analysis | Lecture 4 |
|
|
5 | 2/6 | Leave no relevant data behind: Data search | Lecture 5 |
|
|
6 | 2/8 | Data hubs: Version control for datasets | Lecture 6 |
|
|
Data Preparation: Extraction, Integration, Cleaning | |||||
7 | 2/13 | Knowledge Base Construction: From dark data to insights | Lecture 7 |
|
|
8 | 2/15 | No Class (Theo @ SysML) | |||
9 | 2/20 | Creating Training Data | Lecture 8 |
|
Example class projects posted here! |
10 | 2/22 | Data Integration: Entity Resolution 1 | Lecture 9 |
|
|
11 | 2/27 | Data Integration: Entity Resolution 2 | Lecture 10 |
|
Project proposal due! |
12 | 3/1 | Data Integration: Data Fusion | Lecture 11 |
|
|
13 | 3/6 | Data Wrangling | Lecture 12 |
|
|
14 | 3/8 | Data Cleaning: Error Detection | Lecture 13 |
|
|
15 | 3/13 | Data Cleaning: Error Repairing | Lecture 14 |
|
|
16 | 3/15 | Data Cleaning and Machine Learning | Lecture 15 |
|
|
17 | 3/20 | Management of Data under Uncertainty | Lecture 16 |
|
|
18 | 3/22 | Summary of Data Preparation | Lecture 17 | No Readings. | |
RBMS and ML Integration | |||||
19 | 4/3 | Relational vs Linear Algebra: One semi-ring to rule them all! | Lecture 18 |
|
Intermediate Report due! |
20 | 4/5 | ML and Data Systems: SQL and UDFs | Lecture 19 |
|
|
21 | 4/10 | ML and Data Systems: Statistical Relational Learning Engines | Lecture 20 |
|
|
22 | 4/12 | DB-inspired ML Systems: From Linear Algebra to Execulation Plans and Rewrites | Lecture 21 | ||
23 | 4/17 | DB-inspired ML Systems: Compress, Scan, Index | Lecture 22 |
|
|
24 | 4/19 | ML serving: Feature engineering | Lecture 23 |
|
|
25 | 4/24 | ML serving: Model hubs; repositories for your models | Lecture 24 | ||
26 | 4/26 | Project presentations 1 | |||
27 | 5/1 | Project presentations 2 | |||
28 | 5/3 | Poster presentations |