MIT 6.S191: Deep Learning Fundamentals

Based on MIT’s Introduction to Deep Learning course taught by Alexander Amini & Ava Amini

Introduction: The Revolution of Deep Learning

Deep learning has fundamentally transformed artificial intelligence by teaching computers to learn directly from raw data. This revolution stems from three key convergent factors:

The Three Pillars of Deep Learning Success

1. Big Data Revolution

  • Massive datasets like ImageNet (millions of labeled images)
  • Wikipedia’s vast text corpus enabling language models
  • Easy data collection and storage infrastructure
  • Democratized access to high-quality training data

2. Hardware Acceleration

  • GPU computing enabling parallel processing
  • Specialized hardware like TPUs for AI workloads
  • Cloud computing providing scalable computational resources
  • Moore’s Law improvements in processing power

3. Software Innovation

  • Accessible frameworks: TensorFlow, PyTorch, Keras, JAX
  • Improved algorithms and architectures
  • Open-source ecosystem fostering rapid development
  • Standardized tools reducing implementation complexity

Understanding Intelligence and AI

Before diving into technical details, let’s establish a clear conceptual hierarchy:

Intelligence: The ability to process information to inform future decisions. This fundamental capacity underlies all cognitive processes.

Artificial Intelligence (AI): Building algorithms that mimic human intelligence, encompassing everything from rule-based systems to advanced machine learning.

Machine Learning (ML): A subset of AI where computers learn patterns from data without explicit programming for each specific task.

Deep Learning (DL): A subset of ML using neural networks with multiple layers to automatically extract hierarchical patterns from data.

The Perceptron: Building Block of Intelligence

Fundamental Architecture

The perceptron represents the atomic unit of neural computation - a mathematical model inspired by biological neurons. Understanding its mechanics is crucial for grasping more complex architectures.

Core Components

Inputs (x₁ to xₘ): Numerical features representing the data

  • Could be pixel values in an image
  • Word embeddings in text processing
  • Sensor readings in robotics applications

Weights (w₁ to wₘ): Learnable parameters determining input importance

  • Higher weights amplify corresponding inputs
  • Negative weights suppress inputs
  • Learned through training process

Bias (w₀): Constant offset enabling flexible decision boundaries

  • Allows activation function to shift
  • Critical for learning complex patterns
  • Independent of input values

Weighted Sum (z): Linear combination of all inputs

z = w₀ + Σ(xᵢ × wᵢ) for i=1 to m

Activation Function (g): Non-linear transformation introducing computational flexibility

Common Activation Functions

Sigmoid Function

σ(z) = 1 / (1 + e^(-z))
  • Output range: (0, 1)
  • Smooth gradient enabling backpropagation
  • Useful for probability interpretation
  • Can suffer from vanishing gradients

ReLU (Rectified Linear Unit)

ReLU(z) = max(0, z)
  • Computational efficiency
  • Addresses vanishing gradient problem
  • Introduces sparsity in activations
  • Most commonly used in practice

Hyperbolic Tangent

tanh(z) = (e^z - e^(-z)) / (e^z + e^(-z))
  • Output range: (-1, 1)
  • Zero-centered output
  • Stronger gradients than sigmoid

Mathematical Formulation

The complete perceptron equation:

ŷ = g(w₀ + X^T W)

Where:

  • ŷ: predicted output
  • g: activation function
  • X: input vector
  • W: weight vector
  • w₀: bias term

Practical Example: Binary Classification

Consider a perceptron classifying student pass/fail with:

  • Weights: W = [3, -2]
  • Bias: w₀ = 1
  • Inputs: x₁ = lectures attended, x₂ = hours procrastinated

Forward Propagation Process:

  1. Input: Student attended 4 lectures, procrastinated 2 hours
  2. Weighted Sum: z = 1 + (3 × 4) + (-2 × 2) = 1 + 12 - 4 = 9
  3. Activation: ŷ = σ(9) ≈ 0.9999 (very high probability of passing)

Geometric Interpretation: The equation w₀ + w₁x₁ + w₂x₂ = 0 defines a decision boundary in the input space. Points on different sides correspond to different classifications.

TensorFlow Implementation

import tensorflow as tf
 
# Define a single perceptron layer
perceptron = tf.keras.layers.Dense(
    units=1,           # Single output neuron
    activation='sigmoid',  # Sigmoid activation
    input_shape=(2,)   # Two input features
)
 
# Manual weight initialization
class CustomPerceptron(tf.keras.layers.Layer):
    def __init__(self, output_dim):
        super(CustomPerceptron, self).__init__()
        self.output_dim = output_dim
        
    def build(self, input_shape):
        # Initialize weights and bias
        self.W = self.add_weight(
            shape=(input_shape[-1], self.output_dim),
            initializer='random_normal',
            trainable=True
        )
        self.b = self.add_weight(
            shape=(self.output_dim,),
            initializer='zeros',
            trainable=True
        )
    
    def call(self, inputs):
        # Forward propagation
        z = tf.matmul(inputs, self.W) + self.b
        return tf.nn.sigmoid(z)

From Perceptrons to Deep Networks

Multi-Layer Architecture

While single perceptrons can only learn linearly separable patterns, combining multiple perceptrons creates powerful universal approximators.

Dense (Fully Connected) Layers: Every input connects to every output, enabling complete information flow.

Hidden Layers: Intermediate representations learned automatically, extracting increasingly abstract features.

Deep Networks: Multiple hidden layers creating hierarchical feature extraction:

  • Layer 1: Detect edges and simple patterns
  • Layer 2: Combine edges into shapes
  • Layer 3: Recognize objects from shapes
  • Output Layer: Final classification or regression

Sequential Model Architecture

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dense(32, activation='relu'),
    tf.keras.layers.Dense(10, activation='softmax')  # 10-class classification
])

This architecture progressively reduces dimensionality while extracting increasingly complex features.

Training Neural Networks: The Learning Process

Loss Functions: Quantifying Error

Training requires measuring prediction quality through loss functions:

Binary Cross-Entropy (Binary Classification):

L = -[y log(ŷ) + (1-y) log(1-ŷ)]

Categorical Cross-Entropy (Multi-class Classification):

L = -Σ yᵢ log(ŷᵢ)

Mean Squared Error (Regression):

L = (y - ŷ)²

Empirical Risk Minimization

The training objective seeks weights minimizing average loss across all training data:

W* = argmin (1/n × Σ L(f(x⁽ⁱ⁾; W), y⁽ⁱ⁾))

Where:

  • W*: optimal weights
  • f(x⁽ⁱ⁾; W): model prediction with weights W
  • y⁽ⁱ⁾: true label
  • n: number of training examples

Gradient Descent: The Optimization Engine

Gradient descent iteratively improves weights by following the negative gradient direction:

Core Algorithm:

  1. Initialize: Start with random weights
  2. Forward Pass: Compute predictions and loss
  3. Backward Pass: Calculate gradients using backpropagation
  4. Update: Adjust weights: W ← W - η × ∇L(W)
  5. Repeat: Until convergence

Learning Rate (η): Critical hyperparameter controlling step size

  • Too large: Overshooting, instability, divergence
  • Too small: Slow convergence, local minima trapping
  • Optimal: Smooth, efficient convergence

Backpropagation: Efficient Gradient Computation

Backpropagation applies the chain rule to efficiently compute gradients in multi-layer networks:

# TensorFlow automatic differentiation
with tf.GradientTape() as tape:
    predictions = model(x_train)
    loss = loss_function(y_train, predictions)
 
# Compute gradients
gradients = tape.gradient(loss, model.trainable_variables)
 
# Apply gradients
optimizer.apply_gradients(zip(gradients, model.trainable_variables))

Advanced Optimization Techniques

Stochastic Gradient Descent (SGD): Updates using single examples or mini-batches

  • Faster iteration
  • Introduces beneficial noise
  • Better generalization

Adaptive Learning Rates: Algorithms that adjust learning rates automatically

  • Adam: Combines momentum and adaptive learning rates
  • RMSprop: Adapts to gradient magnitudes
  • AdaGrad: Accumulates squared gradients
# Adam optimizer with adaptive learning rate
optimizer = tf.keras.optimizers.Adam(
    learning_rate=0.001,
    beta_1=0.9,
    beta_2=0.999,
    epsilon=1e-7
)

Addressing Overfitting: Regularization Strategies

Understanding Overfitting

Overfitting occurs when models memorize training data rather than learning generalizable patterns, resulting in poor performance on new data.

Symptoms:

  • High training accuracy, low validation accuracy
  • Model complexity exceeding data complexity
  • Sensitivity to training data variations

Regularization Techniques

Dropout: Randomly deactivates neurons during training

model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu'),
    tf.keras.layers.Dropout(0.3),  # 30% dropout rate
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(10, activation='softmax')
])

Early Stopping: Monitors validation performance and halts training when overfitting begins

early_stopping = tf.keras.callbacks.EarlyStopping(
    monitor='val_loss',
    patience=5,
    restore_best_weights=True
)

L2 Regularization: Penalizes large weights

tf.keras.layers.Dense(
    64, 
    activation='relu',
    kernel_regularizer=tf.keras.regularizers.l2(0.001)
)

Complete Training Example

import tensorflow as tf
import numpy as np
 
# Prepare data
(x_train, y_train), (x_test, y_test) = tf.keras.datasets.mnist.load_data()
x_train = x_train.reshape(-1, 784).astype('float32') / 255.0
x_test = x_test.reshape(-1, 784).astype('float32') / 255.0
 
# Build model
model = tf.keras.Sequential([
    tf.keras.layers.Dense(128, activation='relu', input_shape=(784,)),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(64, activation='relu'),
    tf.keras.layers.Dropout(0.3),
    tf.keras.layers.Dense(10, activation='softmax')
])
 
# Compile model
model.compile(
    optimizer='adam',
    loss='sparse_categorical_crossentropy',
    metrics=['accuracy']
)
 
# Train model
history = model.fit(
    x_train, y_train,
    batch_size=128,
    epochs=100,
    validation_split=0.2,
    callbacks=[
        tf.keras.callbacks.EarlyStopping(patience=5),
        tf.keras.callbacks.ReduceLROnPlateau(factor=0.5, patience=3)
    ]
)
 
# Evaluate model
test_loss, test_accuracy = model.evaluate(x_test, y_test)
print(f'Test accuracy: {test_accuracy:.4f}')

Key Insights and Takeaways

Foundation Principles

  1. Universal Approximation: Multi-layer networks can approximate any continuous function
  2. Feature Learning: Deep networks automatically discover relevant features
  3. Hierarchical Representation: Layers learn increasingly abstract concepts
  4. End-to-End Learning: Direct optimization from input to output

Practical Considerations

  1. Data Quality: Models are only as good as their training data
  2. Architecture Selection: Network design significantly impacts performance
  3. Hyperparameter Tuning: Learning rates, batch sizes, and regularization require careful selection
  4. Computational Resources: Deep learning demands significant computational power

Modern Applications

  • Computer Vision: Image classification, object detection, segmentation
  • Natural Language Processing: Language models, translation, sentiment analysis
  • Robotics: Control systems, perception, planning
  • Scientific Computing: Drug discovery, climate modeling, astronomy

Next Steps in the Deep Learning Journey

This foundational understanding of perceptrons, neural networks, and training processes prepares you for advanced topics:

  1. Deep Sequence Modeling: RNNs, LSTMs, and Transformers
  2. Convolutional Neural Networks: Computer vision applications
  3. Deep Generative Modeling: GANs and Variational Autoencoders
  4. Deep Reinforcement Learning: Agent-based learning
  5. Limitations and New Frontiers: Current challenges and future directions

The journey from simple perceptrons to sophisticated deep learning systems represents one of the most significant advances in computational intelligence. Armed with these fundamentals, you’re ready to explore the vast landscape of modern AI applications and contribute to the ongoing revolution in machine learning.


This article is part of the MIT 6.S191 Deep Learning Series. For hands-on practice, visit the course labs at introtodeeplearning.com.