6.7960: Deep Learning

DALL-E generated depiction of deep learning.

Overview of foundational and state-of-the-art deep learning methods (as of Fall 2024). This semester’s content should also be posted on OCW.

Hacker’s Guide

Presented by Phillip Isola (Lecture 9). Here is a summary:

Become friends with every pixel.

Look at the input.
Look at (all aspects of) the training and output.

Look at the data.

Histogram (distribution) of input features and targets. Want diverse data.
Inspect the data right before calling model(data), for debugging.
Make an inspect_data(X) function that prints type, shape, requires_grad, min/max, and mean/var, for debugging. See lecture notes.

Datasets

Use data augmentation to get the invariances you want.
Make data harder than it would be in reality with domain randomization. The goal is to have training data that mimics possible real world changes in the test data.
Make your data hard enough to learn a lot from.
Formula: max over data(min over param(loss(data, params)))
Use big data: Lots of training pairs, high-dimensional inputs and outputs.
Prediction gets easier with longer/more inputs.

Preprocessing

Standardize all data, probably. x = (x - E[x]) / sqrt(Var[x]).
Sanity check shapes with prime dimensional data.
Check for unintended casts.

Model

The simplest model/explanation that fits the data is the best model.
Start with a standard, popular model. Popularity is more reliable than benchmarks.
Transform your problem into a “solved” problem (like polynomial time reductions).
Every piece of complexity you add has a cost.
Avoid low dimensions (weird behavior). All variables should be >= 8.
The easiest ways to get better performance: Scale your data, model, or compute.
Once you get your system working, remove everything nonessential.

Defaults

Image classification: https://pytorch.org/hub/pytorch_vision_resnet/
Text generation: https://huggingface.co/models?pipeline_tag=text-generation
Transform data into one-hot vectors. Use cross-entropy loss, Adam, and a transformer architecture. Use layer norm, not batch norm.

Code

A lot of your code will be reshaping tensors. Consider using einops for readability.
Copilots are great for boilerplate code, visuals, and syntax. Give it documentation.

Optimization

Adam is good for prototyping. SGD is good for performance. Clip gradients for stability.
Use a constant learning rate at first. Retune/schedule after everything else is settled.
Don’t tie your learning rate schedule to the number of epochs (makes it hard to compare learning curves). There are no epochs in the wild.
Use exponential moving averages (EMA) for gradients.
Make checkpoints of your model in case your computer crashes / the model breaks.

Evaluation

Switch to evaluation mode using model.eval(), and check with print(model.training).
Look at the output and loss curves to diagnose issues. Compare different size models.
WandB and Tensorboard are your friends for logging/checkpointing.

Tuning

Deep learning is analogous to cooking. Learn what each ingredient (data/architecture) and spice (augmentation) does. Find the right combo to solve your problem.
Try extreme settings for a bit. If the model never diverges, your learning rate is too low.

Debugging

Script the optimization/evaluation of your models. Use config files to manage them.
Try to train your model on one datapoint, then a few, then the full set.
Sanity check your loss against a reference value (ln(0.5) = -0.69).
Use the Python debugger. To find NaNs, use torch.autograd.set_detect_anomaly(True).
Common bugs: If out of memory, don’t store gradients at inference time, and clear memory with del when needed. When timing code on GPUs, torch.cuda.synchronize().

Compute

Only parallelize if your model works on one device. More hardware = more problems.
Saturate your GPUs to 100% utilization.
For speed, try torch.cudnn.benchmark, automatic mixed precision, and torch.compile().

Code Links

To PyTorch code!

PyTorch Cheatsheet
Homework/Project Code Folder
- CIFAR10, GNN, Vision Architectures, Hyperparameter Transfer, Transformers, Representation Learning, Generative Models, Online Loss Learning

Outline

Due to no cheatsheets, I’ll include an outline of the lectures + a short summary of each. Refer to lecture notes or the Fall 2024 schedule for more details.

1. Intro to Deep Learning

Review of machine learning and neural nets. Intro to tensors and deep net generalization / the double-descent curve. PyTorch primer.

2. How to train a neural net

Review of gradient descent / SGD. Loss functions like GeLU. Computation graphs. Backpropagation details through chains, MLPs, and DAGs. Differentiable programming, software 2.0 using blocks optimized to a scalar cost. (Homework 1 covered neural net basics.)

3. Approximation Theory

Universal approximation theory of neural nets, proof details (Lipschitz continuous functions, approximate with hyperrectangles). Width vs. depth scaling, depth separation results example (to show that depth can scale better). Practical considerations for generalization: More parameters is better, depth of >6 layers seems to have negligible effect.

4. Architectures for Grids

Building better architectures with good inductive biases (built-in assumptions for the right answer / generalization). Convolutional neural networks (CNNs) providing translation invariance. Multichannel CNNs. Pyramid convnet structures with pooling / strided operations. Practical CNN architecture tips (refer to existing ones). Receptive fields. Positional encoding. Popular architectures like encoder-decoder, ResNet, neural fields like SIREN and NeRF.

5. Graph Neural Networks

GNN usecases (ex: Google Maps), processing general-shaped graphs. Motivation for permutation invariance and equivalence. Images are like graphs. Message passing details (update/aggregate loop). Learning a shortest path algorithm. Graph embeddings (aggregate set of node embeddings). Unrolled GNNs. GNN extensions. Approximation power, unable to distinguish isomorphisms.

6. Neural Network Generalization

Approximation vs. Generalization. Deep nets do generalize (mostly). Generalization theory (Occam’s Razor, bias-variance tradeoff, double descent). Vapnik-Chervonekis theory, which basically says that we need more data and less posssible functions to fit, but this doesn’t explain deep learning. Theories behind why deep nets generalize (implicit regularization, shortest program generalizes the best).

7. Scaling Rules for Optimization

The optimization problem (minimize loss efficiently/optimally), larger width = lower learning rate, larger depth = more residual blocks, Newton’s method, Gauss-Newton decomposition for optimization, proof of steepest descent under an arbitrary norm (ex: spectral norm), norms and dual norms, scaling heuristics, spectral perspective of neural nets, library of atomic modules. (Homework 2 covered steepest descent under the spectral norm and vision architectures.)

8. Transformers

A transformer architecture is what you should use today. Motivation (sparse/global), tokens, tokenization, attention details, self-attention, multihead self attention (MSA), transformer (ViT), transformer code, positional encoding (ex: Fourier basis), causal masking to train GPT. (Homework 3 covered transformers and other sequence models.)

9. Hacker’s guide to DL

Tons of practical tips for deep learning. Split into a separate section at the top.

10. Memory and sequence modeling

CNNs for sequences. Recurrent neural networks (RNNs). Long short term memory intuition (LSTMs), modeling useful cognitive concepts in code / differentiable things. Sequence models and long memory. Teacher forcing/beam search. Larger context transformers (Transformer XL, Longformer, Big Bird, RETRO). Hypernets, code books.

11. Representation Learning I

Neural nets learn representations. Generative modeling. Neural embeddings in representation space. CLIP visualized with mappings. Visualizing CNNs. Encoders. Transfer learning (linear adaptation or finetuning). How to learn a good representation (compact, explanatory, disentangled, interpretable, make subsequent problems easy) with compression or prediction. Autoencoders via compression, clustering/k-means/VQ nets. Self-supervised learning for via prediction (works better than autoencoders) through imputation, the modern way to learn representations due to stronger training signals. Masked autoencoder (MAE), bidirectional transformers (BERT).

12. Representation Learning II: Similarity-Based

Review of first part. Similarity-based representation learning through contrastive signals. Metric learning (group similar data points in representation space). Contrastive losses, need a good distance metric. Triplet network (one positive, one negative pair) and extensions. Presenting hard negative pairs. Encoders maps onto a hypersphere. Self-supervised learning can outperform supervised pre-training sometimes. SimCLR. Metrics for alignment and uniformity. Improvements via fancy data augmentation. Larger batch sizes work better. iNaturalist case study.

13. Representation Learning Theory

Fast training of neural nets using steepest descent under the spectral norm, momentum, low precision, and UV^T computation via iteration for nanoGPT. Kernel methods, approximate functions with a bunch of “bumps”. Gaussian processes, sample a bunch of functions consistent with the data. Covariance functions. Details and proof that neural networks that are infinitely wide, with weights randomly sampled, are equivalent to Gaussian processes. Not really used in practice though. (Homework 4 covered contrastive/representation learning.)

14. Generative Models: Basics

Generative modeling is the inverse of representation learning. Formal problem specification. “Noise” is latent variables, like adjustable knobs. Can learn generators that make data directly, or learn a function that scores data quality, then generate data with high scores. Density functions, energy functions, and samplers. KL divergence. Low energy = High probability. Contrastive divergence algorithm. Autoregressive models. Diffusion models / Gaussian diffusion. Common strategy: Turn generative modeling into a sequence of supervised learning problems. Generative adversarial networks (GANs), student and teacher.

15. Generative Models and Representation Learning

Using decoders as a generative model. Variational autoencoders (VAEs) details as a Mixture of Gaussians. (GPT is great at converting LaTeX screenshots to code.) Map from an input into a Gaussian mean/variance. Tricks: Estimate integral via Monte Carlo sampling, do importance sampling, predict optimal sampling distribution with an encoder. Evidence lower-bound (ELBO). Looks like an autoencoder. Disentanglement with BigGAN. When mapping to a sphere, you will get cuts/seams in the representation.

16. Conditional Generative Models

Structured prediction (modeling joint distributions). Make it categorical by quantizing your data. Semantic segmentation (Sat2Map). Use an objective that can model structure (ex: graphical model, GAN, etc). Conditional GAN for image-to-image (ViT is the modern version of a CNN). Patch discriminator (used in previous versions of stable diffusion). Conditional VAE. Control your data via explicit inputs or latent variables. Autoregressive models are conditional generative models. Cross-attention. Text-to-image diffusion models, LLaVA image-to-text encoder. Unpaired translation, CycleGAN with cycle consistency loss. Paired data isn’t always necessary. (Homework 5 covered various forms of autoencoders.)

17. Out-of-distribution Generalization

Modern ML is only used in high precision, human verification systems. The world is changing, datasets do not. Data poisoning. Adversarial examples from small perturbations, even 3D-printed or in speech recognition, transfer across models trained on the same dataset. Gradient ascent to find one. Defend with adaptive data augmentation (adversarial training) by intentionally finding the worst bit of perturbation. Models will find the easiest signal to follow. Predictive features, some robust, some non-robust. Ex: Fish tend to come with a water background. Not all shortcuts are desirable. Distribution shifts, WILDS benchmark to test robustness. Distributionally robust optimization (DRO) and generalization. Class imbalance fixes. Domain adaptation.

18. Transfer Learning: Models

Pretraining for prior knowledge. Finetune some or all of the original net. Contrastive pre-training, using “glue” adapters to handle different dimensions. Domain adaptation: Learn a projection from the target domain to the source domain. Domain Adversarial Neural Networks (DANN). Unsupervised Domain Adaptation (UDA), exponential moving average. Align and Distill (ALDI). Knowledge distillation: Train a small student model to mimic a teacher model. Cross-model distillation (SoundNet), distilling an ensemble. Compression vs. distillation. Foundation models. Large cost of training models. Deep learning with no data: Just ask through prompting and few-shot learning. CLIP. Visual prompting via image inpainting works too. DALL-E 2.

19. Transfer Learning: Data

Data++: Data as a first class object to transfer knowledge about the inputs. Interpolation from a middle layer, manipulation, composition, optimization for counterfactual reasoning. Generative models as Data++ with DatasetGAN. Explaining a classifier with StylEx (GAN). Contrastive learning + generative modeling by making positive pairs through transformations in latent space. Model collapse / manifold collapse. Model-Agnostic Meta-Learning (MAML), or by sequence modeling.

20. Scaling Laws

Compute doubles every 3 months. Simple algorithms that scale well work. Power laws tend to fit scaling patterns well. The right sized model is the primary focus, exact architecture doesn’t have a big impact (for vanilla nets). Chinchilla (DeepMind) found scaling laws to be heavily dependent on optimization methods. Guideline: For every data point you have, add one new parameter. Data manifold hypothesis to explain scaling laws. Carefully choosing non-redundant data can beat scaling laws.

21. Large Language Models

Selected concepts from 6.8611: Natural Language Processing. See the full NLP post for content.

22. AI for Musicians

Presented by Anna Huang. Diversity of musical instruments/styles. Mawf, which uses differentiable digital signal processing (DDSP). CocoNet (Counterpoint by convolutions) using masked training to generate musical scores, with annealed random sampling. Orderless Neural Autoregressive Density Estimator (OrderlessNADE). MIDI-DDSP to allow fine-grained control over generated music. Model note expression controls as scalar features, use an RNN to make pitch (sequence) with a GAN to check for realistic spectrograms (images). Generate pitch, amplitude, harmonic distribution, and noise magnitude separately. Music transformer that represents each note as an event (note on, note off, note velocity, clock advance amount). Give the transformer relative distance (positional encoding). Reinforcement learning leads to creativity. How you explore is important. Creative Adversarial Networks (CANs). Musicians use AI tools to help with brainstorming.

23. Metrized Deep Learning

A deep learning library should be like a Lego set. Steepest descent refresher. Modules made of atoms (Linear, Conv2d, Embed), bonds (ReLU, FunctionlAttention), and compounds (MLP). Goal is to predict sensitivity of output with respect to small changes in the inputs/weights. Combination/concatenation rules for modules. Scaling can hurt due to worse performance and optimal learning rate drifts. Modulo library for auto-normalization. Faster training with modular dualization (project the gradient to the weight space). This plus Newton-Schulz helped set a training speed record.

24. Inference Methods for Deep Learning

Inference definition of predicting answers for new datapoints (ML vs. statistics). Scaling of both search and learning time. Search in chess (AlphaGo). Best-of-N, beam search, chain-of-thought, verification (of code). Steering model outputs toward human preferences. New paradigm: “Just ask”. Ex: “Make it more”. In-context learning, test-time training (use test-time training on few-shot examples, or self-supervised learning), works well on ARC. STaR (Spatial-temporal Hierarchical Reinforcement Learning) by training on examples that are verified to be correct. o1 possibility: Search for CoTs that solve a problem (randomly sample), finetune on the ones that worked, and repeat.

25. Efficient Policy Optimization for LLMs

RLHF: Supervised fine tuning, preference reward model training, reinforcement learning (PPO). Reset with reference policy, constant rollin and temporary rollout. PPO++ for a better initial state distribution. Dataset reset policy optimization (DR-PO). Reduce computation/memory overhead by rewriting PPO as a regression problem (REBEL). Multi-turn interactions with a semi-realistic simulator (between two chatbots, REFUEL) to prevent performance degradation in later conversation turns.

26. Project (AGAN)

Meta learning with an online loss function (represented by a neural net with adaptable weights), applied only to the discriminator. Didn’t work due to low expressivity of the generator. GANs are very unstable. Online loss makes it even more unstable, unfortunately. Adam is much better than SGD. Training times are quite big, so a good batch size / torch.benchmark matters. Good-looking figures matter a lot for papers and blog posts. Don’t be afraid to reach out / cold email people.

Last updated: 31 December 2024

Back to previous page