About

I am a researcher at Apple. Before that I co-founded Inductiv (acquired by Apple), a company that developed AI solutions for identifying and correcting errors in data. I was also a Professor of Computer Science at ETH Zürich and the University of Wisconsin-Madison.

My research focuses on scalable machine learning algorithms and systems over relational data. Specifically, it explores the fundamental connections between data preparation, data integration, and knowledge management with statistical machine learning and probabilistic inference:

Generative Models for Data Quality: We are exploring the fundamental connections between data cleaning and generative machine learning. The HoloClean project introduced Generative Machine Learning to the problem of data cleaning: We showed how to model data cleaning as statistical learning problem, how attention-based mechanisms and self-supervised learning can automate data cleaning and introduced multiple theoretical results on how to deal with noisy/dirty data. More recently we are exploring the synergies between data cleaning and machine learning deployments in the Picket project. This talk at the Stanford MLsys Seminar provides an overview.
Neural Relational Engines over Billion-scale Data: We are developing a new paradigm of systems to make the use of deep learning models over billion-scale structured data easier, faster, and cheaper. We have started with the Marius project that focuses on a key bottleneck in the development of machine learning systems over large-scale graph data: data movement during training. Marius addresses this bottleneck with a novel data flow architecture that maximizes resource utilization of the entire memory hierarchy (including disk, CPU, and GPU memory). Marius is under active development and available as an open-source project. You can learn more about Marius from our recent OSDI`21 and MLOpsWorld talks.

News

- March, 2023 Congratulations to my sudent Jason Mohoney for becoming an Apple AI/ML Scholar.
- January, 2022 Excited to talk about Data Debugging in ML at my alma mater, ECE @ NTUA.
- June, 2021 New talk about Marius and Machine Learning Over Billion-Edge Graphs at MLOpsWorld.
- March, 2021 Excited to be talking about Software 2.0 for Data Quality at the Stanford ML Sys seminar.
- February, 2021 Excited to talk about our work on Data Quality at CMU (ML with Large Datasets)