MIT 6.S191: Deep Sequence Modeling

From Recurrent Neural Networks to Transformers - Understanding Sequential Data Processing

Introduction: The Sequential Data Challenge

While feedforward networks excel at processing fixed-size inputs independently, the real world is fundamentally sequential. Language, speech, music, financial time series, sensor data, and video all contain temporal dependencies where context and order matter critically.

Consider these examples:

Language: “The bank was steep” vs “The bank was closed” - identical words, different meanings based on context
Music: A sequence of notes creates melody; order determines harmonic progression
Finance: Stock prices depend on historical trends and temporal patterns
Video: Object motion requires understanding across multiple frames

The core limitation of standard feedforward networks is their inability to maintain memory - information from previous inputs that influences current processing.

Recurrent Neural Networks: Adding Memory to Neural Networks

The Fundamental Innovation

Recurrent Neural Networks (RNNs) introduce the concept of internal state - a memory mechanism that allows networks to maintain information across time steps. This transforms neural networks from stateless functions to stateful processors.

Key Innovation: Instead of processing inputs independently, RNNs maintain a hidden state that captures information from all previous inputs in the sequence.

Mathematical Foundation

The RNN computation involves two key equations:

Hidden State Update:

h_t = tanh(W_hh × h_{t-1} + W_xh × x_t + b_h)

Output Generation:

ŷ_t = W_hy × h_t + b_y

Where:

h_t: hidden state at time t
x_t: input at time t
ŷ_t: output at time t
W_hh: hidden-to-hidden weights
W_xh: input-to-hidden weights
W_hy: hidden-to-output weights
b_h, b_y: bias terms

A crucial insight in RNN design is parameter sharing - the same weight matrices are used at every time step. This provides several advantages:

Variable Length Processing: Handle sequences of any length with fixed parameters
Computational Efficiency: Fewer parameters to learn
Translation Invariance: Patterns learned at one time step apply to all time steps
Generalization: Better performance on unseen sequence lengths

TensorFlow Implementation

import tensorflow as tf
 
# Simple RNN layer
rnn_layer = tf.keras.layers.SimpleRNN(
    units=128,           # Hidden state dimensionality
    return_sequences=True,  # Return full sequence or just final output
    return_state=True    # Also return final hidden state
)
 
# Using RNN in a model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, embedding_dim),
    tf.keras.layers.SimpleRNN(128, return_sequences=True),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])
 
# Custom RNN cell implementation
class CustomRNNCell(tf.keras.layers.Layer):
    def __init__(self, units):
        super(CustomRNNCell, self).__init__()
        self.units = units
        self.state_size = units
        
    def build(self, input_shape):
        # Weight matrices
        self.W_xh = self.add_weight(
            shape=(input_shape[-1], self.units),
            initializer='glorot_uniform'
        )
        self.W_hh = self.add_weight(
            shape=(self.units, self.units),
            initializer='orthogonal'
        )
        self.b_h = self.add_weight(
            shape=(self.units,),
            initializer='zeros'
        )
    
    def call(self, inputs, states):
        prev_h = states[0]
        # RNN computation
        h = tf.nn.tanh(tf.matmul(inputs, self.W_xh) + 
                       tf.matmul(prev_h, self.W_hh) + self.b_h)
        return h, [h]  # output, new_states

Language Representation for Neural Networks

The Embedding Challenge

Neural networks process numerical vectors, but language consists of discrete symbols. Converting words to numbers requires careful consideration of semantic relationships.

Word Embeddings: From Symbols to Vectors

One-Hot Encoding (Traditional Approach):

Each word represented by a sparse vector
Dimension equals vocabulary size
Single ‘1’ at word’s index, zeros elsewhere
Limitations: High dimensional, no semantic similarity

Learned Embeddings (Modern Approach):

Dense, low-dimensional vectors (typically 50-300 dimensions)
Learned during training to capture semantic relationships
Similar words have similar vector representations
Enables arithmetic operations: “king” - “man” + “woman” ≈ “queen”

# Embedding layer in TensorFlow
embedding_layer = tf.keras.layers.Embedding(
    input_dim=vocab_size,    # Vocabulary size
    output_dim=embedding_dim,  # Embedding dimension
    input_length=sequence_length  # Fixed sequence length (optional)
)
 
# Usage in model
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(10000, 128, input_length=50),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])

Training RNNs: Backpropagation Through Time

Unrolling the Recurrence

Training RNNs requires “unrolling” the recurrent computation across time, creating a deep feedforward network where each layer corresponds to a time step.

Forward Pass Process:

Initialize hidden state h₀
For each time step t:
- Compute hidden state h_t
- Generate output ŷ_t
- Calculate loss L_t
Sum losses across all time steps

Backward Pass Process:

Compute gradients for each time step
Backpropagate errors through time
Accumulate gradients for shared parameters
Update weights using accumulated gradients

# Training loop with gradient tape
optimizer = tf.keras.optimizers.Adam(learning_rate=0.001)
 
for batch in dataset:
    with tf.GradientTape() as tape:
        # Forward pass
        predictions = model(batch['inputs'])
        loss = loss_function(batch['targets'], predictions)
    
    # Backward pass
    gradients = tape.gradient(loss, model.trainable_variables)
    optimizer.apply_gradients(zip(gradients, model.trainable_variables))

The Vanishing Gradient Problem

Standard RNNs suffer from the vanishing gradient problem when processing long sequences:

Root Cause: Gradients decay exponentially as they propagate backward through time Mathematical Explanation: Gradients are repeatedly multiplied by the weight matrix W_hh Consequence: Network cannot learn long-term dependencies effectively

Gradient Decay:

∂L/∂W ∝ (W_hh)^t × ∂L/∂h_t

When eigenvalues of W_hh are less than 1, gradients vanish exponentially with sequence length.

Long Short-Term Memory Networks (LSTMs)

Architectural Innovation

LSTMs solve the vanishing gradient problem through a sophisticated gating mechanism that controls information flow.

Core Components:

Cell State (C_t): Long-term memory pathway
Hidden State (h_t): Short-term memory and output
Forget Gate: Decides what information to discard
Input Gate: Controls new information storage
Output Gate: Manages information retrieval

LSTM Mathematical Formulation

Forget Gate:

f_t = σ(W_f × [h_{t-1}, x_t] + b_f)

Input Gate:

i_t = σ(W_i × [h_{t-1}, x_t] + b_i)
C̃_t = tanh(W_C × [h_{t-1}, x_t] + b_C)

Cell State Update:

C_t = f_t ∗ C_{t-1} + i_t ∗ C̃_t

Output Gate:

o_t = σ(W_o × [h_{t-1}, x_t] + b_o)
h_t = o_t ∗ tanh(C_t)

TensorFlow LSTM Implementation

# LSTM layer
lstm_layer = tf.keras.layers.LSTM(
    units=128,
    return_sequences=True,
    dropout=0.2,
    recurrent_dropout=0.2
)
 
# Bidirectional LSTM for capturing both forward and backward dependencies
bidirectional_lstm = tf.keras.layers.Bidirectional(
    tf.keras.layers.LSTM(64, return_sequences=True)
)
 
# Complete model for sequence-to-sequence tasks
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size, 256),
    tf.keras.layers.LSTM(512, return_sequences=True, dropout=0.3),
    tf.keras.layers.LSTM(512, dropout=0.3),
    tf.keras.layers.Dense(vocab_size, activation='softmax')
])

Attention Mechanisms: The Breakthrough Innovation

Limitations of RNN-based Architectures

Even LSTMs face fundamental constraints:

Sequential Processing: Cannot parallelize across time steps
Information Bottleneck: Final hidden state must encode entire sequence
Long-range Dependencies: Still difficult for very long sequences
Computational Efficiency: Training is inherently slow

The Attention Revolution

Core Insight: Instead of compressing entire sequences into fixed-size representations, allow the model to attend to different parts of the input as needed.

Key Innovations:

Selective Focus: Attend to relevant information dynamically
Parallelization: Process all positions simultaneously
Long-range Dependencies: Direct connections between distant positions
Interpretability: Attention weights provide insight into model decisions

Self-Attention Mechanism

Self-attention allows each position to attend to all positions in the sequence, including itself.

Mathematical Framework:

Attention(Q, K, V) = softmax(QK^T / √d_k)V

Where:

Q (Query): What information are we looking for?
K (Key): What information is available?
V (Value): The actual information content
d_k: Scaling factor (key dimension)

Multi-Head Attention:

class MultiHeadAttention(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads):
        super(MultiHeadAttention, self).__init__()
        self.num_heads = num_heads
        self.d_model = d_model
        self.depth = d_model // self.num_heads
        
        self.wq = tf.keras.layers.Dense(d_model)
        self.wk = tf.keras.layers.Dense(d_model)
        self.wv = tf.keras.layers.Dense(d_model)
        self.dense = tf.keras.layers.Dense(d_model)
    
    def split_heads(self, x, batch_size):
        x = tf.reshape(x, (batch_size, -1, self.num_heads, self.depth))
        return tf.transpose(x, perm=[0, 2, 1, 3])
    
    def call(self, v, k, q, mask=None):
        batch_size = tf.shape(q)[0]
        
        # Generate Q, K, V
        q = self.wq(q)
        k = self.wk(k)
        v = self.wv(v)
        
        # Split into multiple heads
        q = self.split_heads(q, batch_size)
        k = self.split_heads(k, batch_size)
        v = self.split_heads(v, batch_size)
        
        # Scaled dot-product attention
        scaled_attention = self.scaled_dot_product_attention(q, k, v, mask)
        
        # Concatenate heads
        scaled_attention = tf.transpose(scaled_attention, perm=[0, 2, 1, 3])
        concat_attention = tf.reshape(scaled_attention, 
                                    (batch_size, -1, self.d_model))
        
        # Final linear transformation
        output = self.dense(concat_attention)
        return output

The Transformer Architecture

Complete Architecture Overview

Transformers combine self-attention with feedforward networks and normalization to create powerful sequence models.

Core Components:

Multi-Head Self-Attention: Parallel attention mechanisms
Position Encoding: Inject sequence order information
Layer Normalization: Stabilize training
Residual Connections: Enable deep architectures
Feedforward Networks: Non-linear transformations

Positional Encoding

Since attention mechanisms have no inherent notion of order, transformers inject positional information:

def positional_encoding(position, d_model):
    angle_rads = np.arange(position)[:, np.newaxis] / np.power(10000, 
        (2 * (np.arange(d_model)[np.newaxis, :] // 2)) / np.float32(d_model))
    
    # Apply sin to even indices
    angle_rads[:, 0::2] = np.sin(angle_rads[:, 0::2])
    # Apply cos to odd indices  
    angle_rads[:, 1::2] = np.cos(angle_rads[:, 1::2])
    
    pos_encoding = angle_rads[np.newaxis, ...]
    return tf.cast(pos_encoding, dtype=tf.float32)

Complete Transformer Implementation

class TransformerBlock(tf.keras.layers.Layer):
    def __init__(self, d_model, num_heads, dff, rate=0.1):
        super(TransformerBlock, self).__init__()
        self.att = MultiHeadAttention(d_model, num_heads)
        self.ffn = tf.keras.Sequential([
            tf.keras.layers.Dense(dff, activation='relu'),
            tf.keras.layers.Dense(d_model)
        ])
        self.layernorm1 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.layernorm2 = tf.keras.layers.LayerNormalization(epsilon=1e-6)
        self.dropout1 = tf.keras.layers.Dropout(rate)
        self.dropout2 = tf.keras.layers.Dropout(rate)
    
    def call(self, x, training, mask):
        # Multi-head attention + residual connection + layer norm
        attn_output = self.att(x, x, x, mask)
        attn_output = self.dropout1(attn_output, training=training)
        out1 = self.layernorm1(x + attn_output)
        
        # Feedforward + residual connection + layer norm
        ffn_output = self.ffn(out1)
        ffn_output = self.dropout2(ffn_output, training=training)
        out2 = self.layernorm2(out1 + ffn_output)
        
        return out2
 
# Complete Transformer model
class Transformer(tf.keras.Model):
    def __init__(self, num_layers, d_model, num_heads, dff, 
                 input_vocab_size, maximum_position_encoding, rate=0.1):
        super(Transformer, self).__init__()
        self.d_model = d_model
        self.embedding = tf.keras.layers.Embedding(input_vocab_size, d_model)
        self.pos_encoding = positional_encoding(maximum_position_encoding, d_model)
        
        self.enc_layers = [TransformerBlock(d_model, num_heads, dff, rate) 
                          for _ in range(num_layers)]
        self.dropout = tf.keras.layers.Dropout(rate)
    
    def call(self, x, training, mask):
        seq_len = tf.shape(x)[1]
        
        # Embedding + positional encoding
        x = self.embedding(x)
        x *= tf.math.sqrt(tf.cast(self.d_model, tf.float32))
        x += self.pos_encoding[:, :seq_len, :]
        
        x = self.dropout(x, training=training)
        
        # Pass through encoder layers
        for i in range(len(self.enc_layers)):
            x = self.enc_layers[i](x, training, mask)
        
        return x

Practical Applications and Examples

Language Modeling Example

# Prepare text data
def create_sequences(text, seq_length):
    sequences = []
    for i in range(len(text) - seq_length):
        sequences.append(text[i:i + seq_length + 1])
    return sequences
 
# Build dataset
def build_dataset(sequences, tokenizer):
    dataset = tf.data.Dataset.from_generator(
        lambda: sequences,
        output_signature=tf.TensorSpec(shape=(None,), dtype=tf.int32)
    )
    
    dataset = dataset.map(lambda x: (x[:-1], x[1:]))  # Input/target pairs
    dataset = dataset.batch(32).prefetch(tf.data.AUTOTUNE)
    return dataset
 
# Training configuration
model = Transformer(
    num_layers=6,
    d_model=512,
    num_heads=8,
    dff=2048,
    input_vocab_size=vocab_size,
    maximum_position_encoding=1000
)
 
optimizer = tf.keras.optimizers.Adam(
    learning_rate=CustomSchedule(d_model),
    beta_1=0.9,
    beta_2=0.98,
    epsilon=1e-9
)

Sequence Classification Example

# Sentiment analysis with LSTM
def build_sentiment_model(vocab_size, max_length):
    model = tf.keras.Sequential([
        tf.keras.layers.Embedding(vocab_size, 128, input_length=max_length),
        tf.keras.layers.LSTM(64, dropout=0.3, recurrent_dropout=0.3),
        tf.keras.layers.Dense(32, activation='relu'),
        tf.keras.layers.Dropout(0.5),
        tf.keras.layers.Dense(1, activation='sigmoid')
    ])
    
    model.compile(
        optimizer='adam',
        loss='binary_crossentropy',
        metrics=['accuracy']
    )
    return model

Comparing Architectures: RNNs vs LSTMs vs Transformers

Aspect	RNNs	LSTMs	Transformers
Long-term Dependencies	Poor	Good	Excellent
Training Parallelization	Sequential	Sequential	Parallel
Memory Efficiency	Low	Medium	High (with optimizations)
Computational Speed	Slow	Medium	Fast (parallel)
Interpretability	Limited	Limited	High (attention weights)
Parameter Efficiency	High	Medium	Lower

Modern Applications and Impact

Language Models and AI Applications

Pre-trained Language Models:

BERT: Bidirectional understanding for many NLP tasks
GPT Series: Autoregressive generation and instruction following
T5: Text-to-text unified framework
Modern LLMs: ChatGPT, Claude, Gemini built on transformer architectures

Real-world Applications:

Machine translation (Google Translate, DeepL)
Question answering systems
Code generation (GitHub Copilot, CodeT5)
Content creation and summarization
Conversational AI assistants

Beyond Natural Language Processing

Computer Vision: Vision Transformers (ViTs) for image classification Protein Folding: AlphaFold’s attention mechanisms Reinforcement Learning: Decision transformers for sequential decision making Time Series Analysis: Financial forecasting, sensor data processing Speech Recognition: End-to-end speech-to-text systems

Key Insights and Future Directions

Fundamental Principles

Memory is Essential: Sequential data requires maintaining context across time
Attention > Recurrence: Direct connections outperform sequential processing
Parallelization Enables Scale: Training speed determines practical applicability
Architecture Matters: Design choices significantly impact performance

Current Challenges and Research Directions

Efficiency Improvements:

Linear attention mechanisms
Sparse attention patterns
Memory-efficient architectures
Hardware-optimized implementations

Long Context Understanding:

Extended context windows (100K+ tokens)
Hierarchical attention mechanisms
Memory-augmented transformers
Retrieval-augmented generation

Multimodal Integration:

Vision-language models
Audio-text processing
Cross-modal attention mechanisms
Unified multimodal architectures

Summary and Next Steps

Sequence modeling represents one of the most impactful areas in modern deep learning. The evolution from RNNs to LSTMs to Transformers demonstrates the field’s rapid progress in solving fundamental computational challenges.

Key Takeaways:

RNNs introduced the concept of memory in neural networks
LSTMs solved the vanishing gradient problem through gating mechanisms
Attention eliminated the need for sequential processing
Transformers became the foundation for modern AI systems

The principles learned in sequence modeling extend far beyond natural language processing, influencing computer vision, reinforcement learning, and scientific computing.

Continue Your Deep Learning Journey

Convolutional Neural Networks: Specialized architectures for spatial data
Deep Generative Modeling: Creating new data with neural networks
Deep Reinforcement Learning: Learning through interaction
NVIDIA CUDA: From History to AI Revolution: Optimizing sequence model training

Dopamine and Motivation Systems: Biological inspiration for sequential processing
NVIDIA CUDA: From History to AI Revolution: The infrastructure enabling modern sequence models

This article is part of the MIT 6.S191 Deep Learning Series. Practice implementing these concepts with the course labs at introtodeeplearning.com.

Extreme Tails

Explorer

MIT 6.S191: Deep Sequence Modeling - From RNNs to Transformers