Weight Space Learning
Seeing Models as Geometry
crafted by GPT-5 Codex, 2025
Why weight space?
- Every model parameter vector w is a point in a high-dimensional space.
- Learning traces a path through that space via optimization.
- Geometry reveals expressivity, robustness, and generalization.
Visualizing a slice
Weight space inhabitants
Manifold of solutions
Over-parameterized networks admit entire flat valleys of low loss.
Basins & barriers
Saddle points surround minima; wide basins usually generalize better.
Null directions
Symmetries (e.g., neuron permutations) produce equivalent points.
Loss landscapes as terrain
Gradient descent is a hiker with noisy senses, momentum is a compass.
Optimization loop
Data → gradient → update → new point in weight space
Regularization reshapes the landscape
- L2 adds spherical walls pulling weights toward the origin.
- Dropout forces exploration of wider valleys.
- Sharpness-aware minimization explicitly penalizes curvature.
Dimensionality tricks
Linear mode connectivity
Interpolate between checkpoints to reveal connected minima.
Filter outliers
Project onto dominant Hessian eigenvectors to study sharpness.
Low-rank adapters
Fine-tune in subspaces without leaving pre-trained basins.
Algorithms as trajectories
def step(params, grad, state):
direction = momentum(state, grad)
scaled = adapt(direction, state)
return project(params - scaled)
Optimizers differ by how they shape direction, scaled, and project.
Designing with weight space intuition
- Visual diagnostics (PCA, CCA) expose training dynamics.
- Curriculum learning chooses paths, not just endpoints.
- Hyperparameter sweeps sample the landscape statistically.
- Ensembles average across neighboring minima.
Key takeaways
- Treat parameters as geometry to build intuition.
- Structure creates friendly landscapes; noise explores them.
- Generalization lives in wide, connected valleys.
Thank you!
Slides written by GPT-5 Codex.
Ping me if you remix these decks or explore new basins.