Neural Networks Design
Capacity Heuristics

Network Design

Network design means choosing how many layers you stack (depth), how wide each layer is (width), how tensors connect (topology), and which building blocks to use (dense, convolutional, attention, …). These choices control capacity—roughly, how many different input–output mappings the family of models can represent—and inductive bias—which patterns the architecture prefers a priori. Good design matches data scale, task structure, and compute budget.

depth width inductive bias UAT

Capacity: What Can the Model Represent?

Roughly, more parameters and more nonlinear layers increase the set of functions you can approximate—up to limits imposed by architecture (e.g. a linear model stays linear). But bigger capacity without enough data or regularization invites overfitting: excellent training loss, poor generalization. Too little capacity yields underfitting: the model cannot reduce training error enough.

Validation curves guide you: if train and validation errors are both high, increase capacity or train longer; if train is low but validation is high, add data, regularization, or reduce capacity. Modern practice often starts with an established baseline architecture for the domain (ResNet-style CNNs, Transformer blocks) and adjusts width/depth to match GPU memory and dataset size.

Depth vs Width

Width (many neurons per layer) increases representational power in a single “slice” of computation; very wide shallow nets can approximate many functions. Depth (many layers) enables hierarchical composition: early layers can build simple features, later layers combine them. Empirically, depth helps in vision and language when paired with skip connections and normalization.

There is no universal formula. Rules of thumb for MLPs on tabular data might start with one or two hidden layers of moderate width and grow until validation metrics plateau. For images, convolutional locality and weight sharing are usually more parameter-efficient than gigantic fully connected stacks.

Scaling laws. Research on large models shows predictable relationships between data, parameters, and compute—but best exponents depend on modality and training recipe. Treat scaling as empirical science, not pure theory.

Universal Approximation (What It Does and Does Not Say)

The universal approximation theorem (informally) says that an MLP with a single hidden layer and enough nonlinear units can approximate continuous functions on compact domains arbitrarily well. This is an existence result: it does not tell you how many units you need, how to train them, or that one hidden layer is optimal in practice.

Deep networks often achieve the same accuracy with fewer parameters than a single enormous hidden layer would require. So UAT motivates why neural nets are reasonable function approximators—not why your specific 3-layer net will converge on Tuesday.

Inductive Bias: Architecture Encodes Assumptions

CNNs assume translation equivariance and local structure; they share weights across space. RNNs assume sequential structure with a hidden state. Transformers assume flexible pairwise interactions mediated by attention. An MLP treats every input dimension independently at first mix—fine for some tabular data, wasteful for images where nearby pixels correlate.

Choosing the right inductive bias reduces sample complexity: the model searches a smaller, more relevant hypothesis space. When in doubt, copy a proven blueprint for your modality and modify incrementally.

Practical Heuristics

  • Match output dimension and activation to the loss (e.g. K logits + cross-entropy).
  • Increase width before extreme depth if optimization is unstable; add normalization (BatchNorm, LayerNorm) for deep stacks.
  • Use skip connections (ResNet) when adding depth to help gradient flow.
  • Profile memory: parameter count × bytes × optimizer states matters for large models.
  • Prefer reproducible baselines and ablations over one-off heroic architectures.
Parameter count sketch. Dense layer: d_in * d_out + d_out. Sum across layers for a ballpark; conv layers count kernel parameters times channels.

Example: Compare Two MLP Heads

Shallow wide vs narrower deep
import torch.nn as nn

# Shallow & wide
shallow = nn.Sequential(
    nn.Linear(100, 256), nn.ReLU(),
    nn.Linear(256, 10),
)

# Deeper & narrower
deep = nn.Sequential(
    nn.Linear(100, 128), nn.ReLU(),
    nn.Linear(128, 128), nn.ReLU(),
    nn.Linear(128, 10),
)

def nparams(m):
    return sum(p.numel() for p in m.parameters())

print("shallow params:", nparams(shallow))
print("deep params:  ", nparams(deep))

Summary

  • Design = depth, width, topology, and block choice; together they set capacity and bias.
  • More parameters help only if optimization, data, and regularization align.
  • UAT explains expressive power of MLPs; depth often improves parameter efficiency.
  • Match architecture structure to data (CNN/RNN/Transformer) before brute-forcing width.