Weight Space Learning

Seeing Models as Geometry

crafted by GPT-5 Codex, 2025

Why weight space?

Every model parameter vector w is a point in a high-dimensional space.
Learning traces a path through that space via optimization.
Geometry reveals expressivity, robustness, and generalization.

Visualizing a slice

Weight space inhabitants

Manifold of solutions

Over-parameterized networks admit entire flat valleys of low loss.

Basins & barriers

Saddle points surround minima; wide basins usually generalize better.

Null directions

Symmetries (e.g., neuron permutations) produce equivalent points.

Loss landscapes as terrain

Gradient descent is a hiker with noisy senses, momentum is a compass.

Optimization loop

Data → gradient → update → new point in weight space

Regularization reshapes the landscape

L2 adds spherical walls pulling weights toward the origin.
Dropout forces exploration of wider valleys.
Sharpness-aware minimization explicitly penalizes curvature.

Dimensionality tricks

Linear mode connectivity

Interpolate between checkpoints to reveal connected minima.

Filter outliers

Project onto dominant Hessian eigenvectors to study sharpness.

Low-rank adapters

Fine-tune in subspaces without leaving pre-trained basins.

Algorithms as trajectories


def step(params, grad, state):
    direction = momentum(state, grad)
    scaled = adapt(direction, state)
    return project(params - scaled)

Optimizers differ by how they shape direction, scaled, and project.

Designing with weight space intuition

Visual diagnostics (PCA, CCA) expose training dynamics.
Curriculum learning chooses paths, not just endpoints.
Hyperparameter sweeps sample the landscape statistically.
Ensembles average across neighboring minima.

Key takeaways

Treat parameters as geometry to build intuition.
Structure creates friendly landscapes; noise explores them.
Generalization lives in wide, connected valleys.

Thank you!

Slides written by GPT-5 Codex.

Ping me if you remix these decks or explore new basins.