Neural Networks Basics
Understand the neuron: from the perceptron algorithm to multi-layer networks and backpropagation — with clean Python implementations.
Perceptron
Building block
Forward/Backward
Chain rule
Activation
Sigmoid, ReLU
NumPy
from scratch
The Perceptron — First Neural Model
Invented by Frank Rosenblatt in 1958, the perceptron is the simplest neural network: a single neuron that classifies linear separable patterns.
How it works
- 1 Weighted sum:
z = w·x + b - 2 Step activation: 1 if z ≥ 0 else 0
- 3 Update:
w = w + lr*(y - ŷ)*x
Limitation
Only linear separable functions (AND, OR) – cannot learn XOR. This triggered the first AI winter and led to multi-layer networks.
key insight: depth mattersimport numpy as np
class Perceptron:
def __init__(self, lr=0.01, epochs=15):
self.lr = lr
self.epochs = epochs
self.weights = None
self.bias = None
def activation(self, z):
return 1 if z >= 0 else 0
def fit(self, X, y):
n_samples, n_features = X.shape
self.weights = np.zeros(n_features)
self.bias = 0
for _ in range(self.epochs):
for idx, x_i in enumerate(X):
linear = np.dot(x_i, self.weights) + self.bias
y_pred = self.activation(linear)
update = self.lr * (y[idx] - y_pred)
self.weights += update * x_i
self.bias += update
def predict(self, X):
linear = np.dot(X, self.weights) + self.bias
return np.array([self.activation(z) for z in linear])
Try it on AND gate – converges in <10 iterations.
Activation Functions: Non-linearity is key
Without activation functions, stacked linear layers collapse into one linear transformation. Non-linear activations enable deep networks to approximate any function.
Sigmoid
def sigmoid(x):
return 1/(1+np.exp(-x))
Range (0,1), great for binary output, but vanishing gradient.
Tanh
def tanh(x):
return np.tanh(x)
Range (-1,1), zero-centered, stronger gradients.
ReLU
def relu(x):
return np.maximum(0,x)
No saturation, sparse; dead neurons risk.
Leaky ReLU
def leaky_relu(x, alpha=0.1):
return np.where(x>0, x, alpha*x)
Softmax
def softmax(x):
ex = np.exp(x - np.max(x))
return ex / ex.sum()
Multi-class probability.
Forward Propagation & Backpropagation
Forward pass
Compute activations layer by layer, cache intermediate values for gradient.
Backward pass (chain rule)
δL/δW = (δL/δa) * (δa/δz) * (δz/δW)
# assume sigmoid activation, MSE loss
def backward(self, X, y, a1, a2):
m = X.shape[0]
# output layer gradient
dz2 = a2 - y.reshape(-1,1) # dL/dz2
dW2 = (1/m) * a1.T @ dz2
db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
# hidden layer gradient
da1 = dz2 @ self.W2.T
dz1 = da1 * (a1 * (1 - a1)) # sigmoid derivative
dW1 = (1/m) * X.T @ dz1
db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)
Multi-Layer Perceptron (MLP) from Scratch
Complete implementation of a flexible neural network with one hidden layer using only NumPy. Foundation for modern deep learning.
import numpy as np
class MLP:
def __init__(self, input_size, hidden_size, output_size, lr=0.1):
self.lr = lr
self.W1 = np.random.randn(input_size, hidden_size) * 0.5
self.b1 = np.zeros((1, hidden_size))
self.W2 = np.random.randn(hidden_size, output_size) * 0.5
self.b2 = np.zeros((1, output_size))
def sigmoid(self, x):
return 1 / (1 + np.exp(-x))
def sigmoid_deriv(self, x):
return x * (1 - x)
def forward(self, X):
self.z1 = X @ self.W1 + self.b1
self.a1 = self.sigmoid(self.z1)
self.z2 = self.a1 @ self.W2 + self.b2
self.a2 = self.sigmoid(self.z2)
return self.a2
def backward(self, X, y, output):
m = X.shape[0]
self.dz2 = output - y.reshape(-1,1)
self.dW2 = (1/m) * self.a1.T @ self.dz2
self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
self.da1 = self.dz2 @ self.W2.T
self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
self.dW1 = (1/m) * X.T @ self.dz1
self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)
def update(self):
self.W1 -= self.lr * self.dW1
self.b1 -= self.lr * self.db1
self.W2 -= self.lr * self.dW2
self.b2 -= self.lr * self.db2
def fit(self, X, y, epochs=1000):
for i in range(epochs):
output = self.forward(X)
self.backward(X, y, output)
self.update()
if i % 200 == 0:
loss = np.mean((output - y)**2)
print(f"epoch {i}, loss: {loss:.6f}")
Neural Nets in Keras & PyTorch
TensorFlow/Keras
import tensorflow as tf
model = tf.keras.Sequential([
tf.keras.layers.Dense(64, activation='relu', input_shape=(4,)),
tf.keras.layers.Dense(32, activation='relu'),
tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mse')
PyTorch
import torch.nn as nn
class Net(nn.Module):
def __init__(self):
super().__init__()
self.fc1 = nn.Linear(4, 8)
self.fc2 = nn.Linear(8, 4)
self.out = nn.Linear(4, 1)
def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
return torch.sigmoid(self.out(x))
transfer learning autodiff GPU
Weight Initialization & Optimizers
Initialization
- Zero init → symmetry, no learning
- Small random (0.01) – ok for shallow
- Xavier/Glorot for sigmoid/tanh
- He init for ReLU
Optimizers
Batch GD, SGD, Mini-batch. Momentum, Adam, RMSprop adapt learning rates.
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
Why do neural networks work?
Universal Approximation Theorem: A feedforward network with a single hidden layer can approximate any continuous function, given sufficient neurons and non-linear activation.
Real‑world usage
Regression & Forecasting
Housing prices, stock trends, energy load.
Classification
Spam detection, credit risk, medical diagnosis.
Feature learning
Autoencoders, embeddings, representation learning.