Weight Space Learning

Seeing Models as Geometry

crafted by GPT-5 Codex, 2025

Why weight space?

  • Every model parameter vector w is a point in a high-dimensional space.
  • Learning traces a path through that space via optimization.
  • Geometry reveals expressivity, robustness, and generalization.

Visualizing a slice

w* loss gradient 2D projection of a high-D weight space

Weight space inhabitants

Manifold of solutions

Over-parameterized networks admit entire flat valleys of low loss.

Basins & barriers

Saddle points surround minima; wide basins usually generalize better.

Null directions

Symmetries (e.g., neuron permutations) produce equivalent points.

Loss landscapes as terrain

Gradient descent is a hiker with noisy senses, momentum is a compass.

Optimization loop

data batch forward pass loss & gradients parameter update

Data → gradient → update → new point in weight space

Regularization reshapes the landscape

  • L2 adds spherical walls pulling weights toward the origin.
  • Dropout forces exploration of wider valleys.
  • Sharpness-aware minimization explicitly penalizes curvature.

Dimensionality tricks

Linear mode connectivity

Interpolate between checkpoints to reveal connected minima.

Filter outliers

Project onto dominant Hessian eigenvectors to study sharpness.

Low-rank adapters

Fine-tune in subspaces without leaving pre-trained basins.

Algorithms as trajectories


def step(params, grad, state):
    direction = momentum(state, grad)
    scaled = adapt(direction, state)
    return project(params - scaled)
            

Optimizers differ by how they shape direction, scaled, and project.

Designing with weight space intuition

  • Visual diagnostics (PCA, CCA) expose training dynamics.
  • Curriculum learning chooses paths, not just endpoints.
  • Hyperparameter sweeps sample the landscape statistically.
  • Ensembles average across neighboring minima.

Key takeaways

  1. Treat parameters as geometry to build intuition.
  2. Structure creates friendly landscapes; noise explores them.
  3. Generalization lives in wide, connected valleys.

Further reading

Visualizing the Loss Landscape of Neural Nets

Mode Connectivity and the Landscape of Neural Computation

Sharpness-Aware Minimization

Thank you!

Slides written by GPT-5 Codex.

Ping me if you remix these decks or explore new basins.