Teaching

Probabilistic Graphical Models

Many of the problems in artificial intelligence, statistics, computer systems, computer vision, natural language processing, and computational biology, among many other fields, can be viewed as the search for a coherent global conclusion from local information. The probabilistic graphical models framework provides an unified view for this wide range of problems, enables efficient inference, decision-making and learning in problems with a very large number of attributes and huge datasets. This graduate-level course will provide you with a strong foundation for both applying graphical models to complex problems and for addressing core research topics in graphical models.

Introduction to Machine Learning

[Course Website]

Machine Learning is concerned with computer programs that automatically improve their performance through experience (e.g., programs that learn to recognize human faces, recommend music and movies, and drive autonomous robots). This course covers the theory and practical algorithms for machine learning from a variety of perspectives. We cover topics such as Linear Regression, SVMs, Neural Networks, Graphical Models, Clustering, etc. Programming assignments include hands-on experiments with various learning algorithms. This course is designed to give a PhD-level student a thorough grounding in the methodologies, technologies, mathematics and algorithms currently needed by people who do research in machine learning.

Talks

CIAI

On the Utility of Gradient Compression in Distributed Training Systems

CIAI Colloquium, MBZUAI 2022

Abs Talk

A rich body of prior work has highlighted the existence of communication bottlenecks in distributed training. To alleviate these bottlenecks, a long line of recent research proposes to use gradient compression methods. In this talk, Dr. Hongyi Wang (CMU) will first evaluate gradient compression methods’ efficacy and compare their scalability with optimized implementations of synchronous data-parallel SGD across more than 200 realistic distributed setups. The observation is that, surprisingly, only in six cases out of 200, do gradient compression methods provide promising speedup. He will then introduce our extensive investigation to identify the root causes of this phenomenon and present a performance model that can be used to identify the benefits of gradient compression for a variety of system setups. Finally, Dr. Hongyi will propose a list of desirable properties (along with two algorithmic instances) that a gradient compression method should satisfy, in order for it to provide significant speedup in real distributed training systems.
Baidu

From Learning, to Meta-Learning, to "Lego-Learning” – theory, system, and applications

Baidu 2021

Abs Talk

Software systems for complex tasks - such as controlling manufacturing processes in real-time; or writing radiological case reports within a clinical workflow – are becoming increasingly sophisticated and consist of a large number of data, model, algorithm, and system elements and modules. Traditional benchmark/leaderboard-driven bespoke approaches in the Machine Learning community are not suited to meet the highly demanding industrial standards beyond algorithmic performance, such as cost-effectiveness, safety, scalability, and automatability, typically expected in production systems. In this talk, I discuss some technical issues toward addressing these challenges: 1) a theoretical framework for trustworthy and panoramic learning with all experiences; 2) optimization methods to best the effort for learning under such a principled framework; 3) compositional strategies for building production-grade ML programs from standard parts. I will present our recent work toward developing a standard model for Learning that unifies different machine learning paradigms and algorithms, then a Bayesian blackbox optimization approach to Meta Learning in the space of hyperparameters, model architectures, and system configurations, and finally principles and designs of standardized software Legos that facilitate cost-effective building, training, and tunning of practical ML pipelines and systems.
KDD DLD

It is time for deep learning to understand its expense bills

KDD Deep Learning Day 2021

Abs Talk

In the past several years, deep learning has dominated both academic and industrial R&D over a wide range of applications, with two remarkable trends: 1) developing and training ever larger "all-purpose" monster models over all data possibly available, with a astounding 10,000x parameter number increase in recent 3 years; 2) developing and assembling end-to-end "white-boxes" deployments with ever larger number of component sub-models that need to be highly customized and interoperative. Progresses made to the leaderboards or featured in news headlines are highlighting metrics such as saliency of content production, accuracy on labeling, or speed of convergence, but a number of key challenges impacting the cost effectiveness of such results, and eventually the sustainability of current R&D efforts in DL, are not receiving enough attention: 1) For large models, how many lines of code outside of the DL model are need to parallelize the computing over a computer cluster? (2) Which/How many hardware resources to use to train and deploy the model? (3) How to tune the model, the code, and the system to achieve optimum performance? (4) Can we automate composition, parallelization, tuning, and resource sharing between many users and jobs? In this talk, I will discuss these issues as a core focus in SysML research, and I will present some preliminary results on how to build standardizable, adaptive, and automatable system support for DL based on first principles (when available) underlying DL design and implementation.
ACL Meta-NLP

Learning-to-learn through Model-based Optimization: HPO, NAS, and Distributed Systems

ACL 2021 workshop on Meta Learning and Its Applications to Natural Language Processing 2021

Abs Talk

In recent years we have seen rapid progress in developing modern NLP applications, by either building omni-purpose systems via training massive language models such as GPT-3 on big data, or building industrial solutions for specific real-world use cases via composition from pre-made modules. In both cases, a bottleneck developers often face is the effort required to determine the best way to train the model: such as how to tune the optimal configuration of hyper-parameters of the model(s), big or small, single or multiple; how to choose the best structure of a single large network or a pipeline of multiple model modules; or even how to dynamically pick the best learning rate and gradient-update transmission/synchronization scheme to achieve best “Goodput” of training on a cluster. This is a special area in meta-learning that concerns the question of “learning to learn”. However, many existing methods remain rather primitive, including random search, simple line or grid (or hyper-grid) search, and genetic algorithms, which suffer many limitations such as optimality, efficiency, scalability, adaptability, and ability to leverage domain knowledge. In this talk, we present a learning-to-learn methodology based on model-based optimization (MBO), which leverages machine learning models which take actions to gather information and provide recommendations to efficiently improve performance. This exhibits several advantages over existing alternatives: 1) provides adaptive/elastic algorithms that improve performance online; 2) we can incorporate domain knowledge into these models for improved recommendations; 3) can easily facilitate more-data-efficient automatic learning-to-learn, or Auto-ML. We show applications of Auto-ML via MBO in three main tasks: hyper-parameter tuning, neural architecture search, and Goodput optimization in distributed systems. We argue that such applications can improve productivity and performance of NLP systems across the board.
ICML ML4Data

A Data-Centric View for Composable Natural Language Processing

ICML2021 ML4data Workshop 2021

Abs Talk

Empirical natural language processing (NLP) systems in application domains such as healthcare, finance, and education involve frequent manipulation of data and interoperation among multiple components, ranging from data ingestion, text retrieval, analysis, generation, and even human interactions like visualization and annotation. The diverse nature of the components in such complex systems imposes challenges to create standardized, robust and reusable components. In this talk, we present a data centric view of NLP operation and tooling, which bridges different style of software libraries, different user personas, and over additional infrastructures such as those for visualization and distributed training. We propose a highly universal data representation called DataPack, which builds on a flexible type-ontology that is morphable and extendable to subsume any commonly used data formats in all known (and hopefully, future) NLP tasks, yet remains invariant as a software data structure that can be passed across any NLP building blocks. Based on this abstraction, we develop Forte, a Data-Centric Framework for Composable NLP Workflows, with rich in-house processors, standardized 3rd-party API wrappers, and operation logics implemented at the right level of abstraction to facilitate rapid composition of sophisticated NLP solutions with heterogeneous components. By defining and leveraging appropriate abstractions of NLP data, Forte aims bridge silos and divergent efforts in NLP tool development, bring good software engineering practices into NLP development, with the goal to help NLP practitioners to build robust NLP systems more efficiently.