Neural Networks Basics Complete Guide

Beginner to Advanced AI & Deep Learning

Neural Networks Basics

Understand the neuron: from the perceptron algorithm to multi-layer networks and backpropagation — with clean Python implementations.

Perceptron

Building block

Forward/Backward

Chain rule

Activation

Sigmoid, ReLU

NumPy

from scratch

The Perceptron — First Neural Model

Invented by Frank Rosenblatt in 1958, the perceptron is the simplest neural network: a single neuron that classifies linear separable patterns.

How it works

1 Weighted sum: z = w·x + b
2 Step activation: 1 if z ≥ 0 else 0
3 Update: w = w + lr*(y - ŷ)*x

Limitation

Only linear separable functions (AND, OR) – cannot learn XOR. This triggered the first AI winter and led to multi-layer networks.

key insight: depth matters

📁 Perceptron from scratch – NumPy

import numpy as np

class Perceptron:
    def __init__(self, lr=0.01, epochs=15):
        self.lr = lr
        self.epochs = epochs
        self.weights = None
        self.bias = None

    def activation(self, z):
        return 1 if z >= 0 else 0

    def fit(self, X, y):
        n_samples, n_features = X.shape
        self.weights = np.zeros(n_features)
        self.bias = 0

        for _ in range(self.epochs):
            for idx, x_i in enumerate(X):
                linear = np.dot(x_i, self.weights) + self.bias
                y_pred = self.activation(linear)
                update = self.lr * (y[idx] - y_pred)
                self.weights += update * x_i
                self.bias += update

    def predict(self, X):
        linear = np.dot(X, self.weights) + self.bias
        return np.array([self.activation(z) for z in linear])

Try it on AND gate – converges in <10 iterations.

Activation Functions: Non-linearity is key

Without activation functions, stacked linear layers collapse into one linear transformation. Non-linear activations enable deep networks to approximate any function.

Sigmoid

def sigmoid(x):
    return 1/(1+np.exp(-x))

Range (0,1), great for binary output, but vanishing gradient.

Tanh

def tanh(x):
    return np.tanh(x)

Range (-1,1), zero-centered, stronger gradients.

ReLU

def relu(x):
    return np.maximum(0,x)

No saturation, sparse; dead neurons risk.

Leaky ReLU

def leaky_relu(x, alpha=0.1):
    return np.where(x>0, x, alpha*x)

Softmax

def softmax(x):
    ex = np.exp(x - np.max(x))
    return ex / ex.sum()

Multi-class probability.

Selection rule: ReLU for hidden layers, sigmoid for binary output, softmax for multi-class.

Forward Propagation & Backpropagation

Input X → [W1,b1] → z1 → a1 = σ(z1) → [W2,b2] → z2 → a2 = σ(z2) → Loss L(ŷ,y)

↻ Backward: dL/dW2 ← dL/da2 * da2/dz2 * dz2/dW2 ... chain rule

Forward pass

Compute activations layer by layer, cache intermediate values for gradient.

Backward pass (chain rule)

δL/δW = (δL/δa) * (δa/δz) * (δz/δW)

🔁 Backpropagation in 2-layer net (NumPy)

# assume sigmoid activation, MSE loss
def backward(self, X, y, a1, a2):
    m = X.shape[0]
    # output layer gradient
    dz2 = a2 - y.reshape(-1,1)          # dL/dz2
    dW2 = (1/m) * a1.T @ dz2
    db2 = (1/m) * np.sum(dz2, axis=0, keepdims=True)
    # hidden layer gradient
    da1 = dz2 @ self.W2.T
    dz1 = da1 * (a1 * (1 - a1))         # sigmoid derivative
    dW1 = (1/m) * X.T @ dz1
    db1 = (1/m) * np.sum(dz1, axis=0, keepdims=True)

Multi-Layer Perceptron (MLP) from Scratch

Complete implementation of a flexible neural network with one hidden layer using only NumPy. Foundation for modern deep learning.

🧠 NeuralNetwork class – forward, backward, train

import numpy as np

class MLP:
    def __init__(self, input_size, hidden_size, output_size, lr=0.1):
        self.lr = lr
        self.W1 = np.random.randn(input_size, hidden_size) * 0.5
        self.b1 = np.zeros((1, hidden_size))
        self.W2 = np.random.randn(hidden_size, output_size) * 0.5
        self.b2 = np.zeros((1, output_size))

    def sigmoid(self, x):
        return 1 / (1 + np.exp(-x))

    def sigmoid_deriv(self, x):
        return x * (1 - x)

    def forward(self, X):
        self.z1 = X @ self.W1 + self.b1
        self.a1 = self.sigmoid(self.z1)
        self.z2 = self.a1 @ self.W2 + self.b2
        self.a2 = self.sigmoid(self.z2)
        return self.a2

    def backward(self, X, y, output):
        m = X.shape[0]
        self.dz2 = output - y.reshape(-1,1)
        self.dW2 = (1/m) * self.a1.T @ self.dz2
        self.db2 = (1/m) * np.sum(self.dz2, axis=0, keepdims=True)
        self.da1 = self.dz2 @ self.W2.T
        self.dz1 = self.da1 * self.sigmoid_deriv(self.a1)
        self.dW1 = (1/m) * X.T @ self.dz1
        self.db1 = (1/m) * np.sum(self.dz1, axis=0, keepdims=True)

    def update(self):
        self.W1 -= self.lr * self.dW1
        self.b1 -= self.lr * self.db1
        self.W2 -= self.lr * self.dW2
        self.b2 -= self.lr * self.db2

    def fit(self, X, y, epochs=1000):
        for i in range(epochs):
            output = self.forward(X)
            self.backward(X, y, output)
            self.update()
            if i % 200 == 0:
                loss = np.mean((output - y)**2)
                print(f"epoch {i}, loss: {loss:.6f}")

Test on XOR: 2→4→1 network, sigmoid, trained 2000 epochs → converges below 0.005 MSE.

Neural Nets in Keras & PyTorch

TensorFlow/Keras

import tensorflow as tf
model = tf.keras.Sequential([
    tf.keras.layers.Dense(64, activation='relu', input_shape=(4,)),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(optimizer='adam', loss='mse')

PyTorch

import torch.nn as nn
class Net(nn.Module):
    def __init__(self):
        super().__init__()
        self.fc1 = nn.Linear(4, 8)
        self.fc2 = nn.Linear(8, 4)
        self.out = nn.Linear(4, 1)
    def forward(self, x):
        x = torch.relu(self.fc1(x))
        x = torch.relu(self.fc2(x))
        return torch.sigmoid(self.out(x))

transfer learning autodiff GPU

Weight Initialization & Optimizers

Initialization

Zero init → symmetry, no learning
Small random (0.01) – ok for shallow
Xavier/Glorot for sigmoid/tanh
He init for ReLU

Optimizers

Batch GD, SGD, Mini-batch. Momentum, Adam, RMSprop adapt learning rates.

optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)

Why do neural networks work?

Universal Approximation Theorem: A feedforward network with a single hidden layer can approximate any continuous function, given sufficient neurons and non-linear activation.

Real‑world usage

Regression & Forecasting

Housing prices, stock trends, energy load.

Classification

Spam detection, credit risk, medical diagnosis.

Feature learning

Autoencoders, embeddings, representation learning.

Next: Activation Functions