CS839 Data Management for Machine Learning Applications

COMP SCI 1257 on TuTh 2:30-3:45pm

Chat with us on the course Piazza site if you have any questions!

Class repository: CS839 Drive


We are seeing widespread investments in machine learning (ML) that enable computers to interpret what they see, communicate in natural language, answer complex questions, and interact with their environment. There is a hidden catch, however: all state-of-the-art ML systems rely on high-effort data management tasks like data exploration, data preparation and data cleaning. The goal of this seminar course is to study data management challenges that arise in the context of machine learning pipelines. The focus will be on cutting-edge problems in the context of ML pipelines, related to (1) data exploration and understanding, (2) data integration, cleaning, and validation, and (3) data preparation for ML models and serving of production ML applications. The seminar will be very interactive and collaborative. The topics covered and the depth of coverage will depend on the participants' input and interests. The goal of the course is to give you an indepth look at an important, emerging topic in data management research. This course will provide research opportunities in the areas of data management, human-computer interaction, and machine learning. Along the way, you will also pick up some practical experience in reading and presenting research papers, synthesizing research across desperate areas, using existing tools, and doing a course project that ideally will lead to a publishable paper.

Class Logistics

Class Format

  • This is a seminar course. Each class will consist of presentations and discussion. Students will be required to do a class project for the course (60%). A significant portion of the grade will be based on class participation, which includes paper presentations, contributions to paper reviews, and paper discussions (40%). Because of the interactive nature of the course, and space limitations, auditing is discouraged.


  • Mathematical maturity and a basic course in probability required. Background in algorithms, databases, machine learning, and graphical models suggested.


  • You will need to form groups of up to three people and work on certain assignments as described below.
  • A research class project: Each group will work on a research project and file a single submission. The project will be broken down to five assignments: (1) initial research proposal, (2) Intermediate report, (3) final report, (4) final presentation (in class), and (5) poster presentation (people from the entire department will be invited).
  • Questions, Comments, and Responses (QCRs): During each class you will need to provide 3 questions and 3 comments for each (mandatory) paper. QCs will be individual assignments. For each class one of the groups (see above) will be responsible to lead a discussion on the papers and provide answers to all posted questions during class. Comments and answers will then be summarized in a written report which will be submitted and shared with everyone in the class.
  • There will be no midterm or final exams.


  • Class time may be adjusted to accomodate external talks releated to the class.
  • Google drive for deliverables: CS839 Drive

Tentative Lecture Plan (Subject to Change)

# Date Topic Lecture Materials Reading Material Assignments
Introduction and Class Overview
1 1/23 Logistics and Data Management for Production ML Lecture 1
2 1/25 DB and ML integration: A systems percpective Lecture 2
Data Exploration and Understanding
3 1/30 What to expect from data: Data driven visualizations Lecture 3
4 2/1 From data cubes to feature-based analysis Lecture 4
5 2/6 Leave no relevant data behind: Data search Lecture 5
6 2/8 Data hubs: Version control for datasets Lecture 6
Data Preparation: Extraction, Integration, Cleaning
7 2/13 Knowledge Base Construction: From dark data to insights Lecture 7
8 2/15 No Class (Theo @ SysML)
9 2/20 Creating Training Data Lecture 8 Example class projects posted here!
10 2/22 Data Integration: Entity Resolution 1 Lecture 9
11 2/27 Data Integration: Entity Resolution 2 Lecture 10 Project proposal due!
12 3/1 Data Integration: Data Fusion Lecture 11
13 3/6 Data Wrangling Lecture 12
14 3/8 Data Cleaning: Error Detection Lecture 13
15 3/13 Data Cleaning: Error Repairing Lecture 14
16 3/15 Data Cleaning and Machine Learning Lecture 15
17 3/20 Management of Data under Uncertainty Lecture 16
18 3/22 Summary of Data Preparation Lecture 17 No Readings.
RBMS and ML Integration
19 4/3 Relational vs Linear Algebra: One semi-ring to rule them all! Lecture 18 Intermediate Report due!
20 4/5 ML and Data Systems: SQL and UDFs Lecture 19
21 4/10 ML and Data Systems: Statistical Relational Learning Engines Lecture 20
22 4/12 DB-inspired ML Systems: From Linear Algebra to Execulation Plans and Rewrites Lecture 21
23 4/17 DB-inspired ML Systems: Compress, Scan, Index Lecture 22
24 4/19 ML serving: Feature engineering Lecture 23
25 4/24 ML serving: Model hubs; repositories for your models Lecture 24
26 4/26 Project presentations 1
27 5/1 Project presentations 2
28 5/3 Poster presentations

Questions and Comments20%
Responses and Discussion20%
Project Proposal10%
Project Intermediate Report10%
Project Final Report30%
Project Presentation and Poster10%
Office Hours

Theo: by appointment @ Room CS4361

Late Policy and Deliverables
There will be no late dates for the project deliverables. However, you have the option to skip up to three QCs. Additional extensions may be granted in the case of a severe medical or family emergency.
The template of this website was created by HazyReseach@Stanford.