Transformers: Language, Vision, and Diffusion

January 31, 2026

Introduction

The Transformer architecture revolutionized machine learning. Introduced in 2017 for machine translation, it quickly became the foundation for nearly all state-of-the-art models across NLP, computer vision, speech, and multimodal AI.

This article provides a complete understanding of transformers: the original architecture, how it was adapted for vision, and how it powers modern diffusion models. We'll see how the same core principles apply across wildly different domains.

The Original Transformer

Before Transformers: The RNN Era

Before 2017, sequence modeling relied on recurrent neural networks (RNNs). RNNs process sequences one step at a time, maintaining a hidden state:

$$h_t = f(h_{t-1}, x_t)$$

This sequential nature created problems:

LSTMs and GRUs improved gradient flow but didn't solve the fundamental sequential bottleneck.

Attention Is All You Need

Vaswani et al. (2017) proposed a radical idea: remove recurrence entirely, rely solely on attention mechanisms. The result was the Transformer, named for its ability to transform input sequences to output sequences through attention.

The Encoder-Decoder Architecture

The original Transformer follows an encoder-decoder structure for sequence-to-sequence tasks (e.g., translation).

Encoder: Processes the input sequence and creates contextualized representations.

Decoder: Generates the output sequence autoregressively, conditioned on encoder outputs.

Input Embeddings and Position Encoding

First, convert input tokens to embeddings:

$$E = \text{Embedding}(x)$$

Add positional encodings to inject position information:

$$X^{(0)} = E + PE$$

Without positional encoding, attention is permutation-invariant and loses all order information.

The Encoder Block

The encoder consists of $N$ identical layers (typically $N=6$). Each layer has two sub-layers:

1. Multi-Head Self-Attention:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

with queries, keys, and values all derived from the same input:

$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

Multiple heads allow attending to different representation subspaces:

$$\text{MultiHead}(X) = \text{Concat}(\text{head}_1, \ldots, \text{head}_h) W^O$$

2. Position-wise Feed-Forward Network:

$$\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$

Applied independently to each position. Typically, the hidden dimension is $4 \times d_{\text{model}}$.

Residual Connections and Layer Normalization:

Each sub-layer uses a residual connection followed by layer normalization:

$$\text{LayerNorm}(x + \text{Sublayer}(x))$$

This helps gradient flow and stabilizes training.

The Decoder Block

The decoder also has $N$ identical layers, but with three sub-layers:

1. Masked Multi-Head Self-Attention: Prevents positions from attending to future positions. Essential for autoregressive generation.

2. Cross-Attention: Attends to encoder outputs. Queries come from decoder, keys and values from encoder:

$$Q = X_{\text{dec}}W^Q, \quad K = X_{\text{enc}}W^K, \quad V = X_{\text{enc}}W^V$$

This allows the decoder to access relevant parts of the input.

3. Position-wise Feed-Forward: Same as in encoder.

Output Layer

A final linear layer projects to vocabulary size, followed by softmax:

$$P(y_t | y_{<t}, x) = \text{softmax}(h_t W_{\text{out}} + b)$$

Training

Trained with teacher forcing: use ground-truth previous tokens as decoder input, predict next token.

Loss is cross-entropy over the vocabulary:

$$\mathcal{L} = -\sum_{t=1}^T \log P(y_t | y_{<t}, x)$$

Why Transformers Work

The Transformer's success comes from several factors:

Parallelization: All positions processed simultaneously, unlike RNNs. This enables scaling to large datasets and models.

Long-range dependencies: Direct connections between any two positions via attention. Information doesn't need to flow through many recurrent steps.

Flexible receptive field: Attention dynamically determines what to focus on, unlike fixed convolutional kernels.

Compositionality: Multiple layers allow hierarchical representations. Early layers capture local patterns, deeper layers capture global structure.

Variants: BERT and GPT

BERT: Encoder-Only Architecture

Devlin et al. (2019) introduced BERT (Bidirectional Encoder Representations from Transformers), using only the encoder stack.

Key innovation: Masked language modeling (MLM). Randomly mask 15% of input tokens, train model to predict them:

$$\mathcal{L}_{\text{MLM}} = -\mathbb{E}_{x \sim D} \sum_{i \in \mathcal{M}} \log P(x_i | x_{\setminus \mathcal{M}})$$

where $\mathcal{M}$ is the set of masked positions.

Bidirectional context: Each token attends to all other tokens (no masking). This creates rich contextual representations.

BERT revolutionized NLP by showing that pre-training on unlabeled text followed by fine-tuning on downstream tasks achieves state-of-the-art across many tasks.

GPT: Decoder-Only Architecture

OpenAI's GPT series uses only the decoder stack, trained autoregressively to predict the next token:

$$\mathcal{L} = -\sum_{t=1}^T \log P(x_t | x_{<t})$$

Causal masking: Position $i$ only attends to positions $\leq i$. This enables left-to-right generation.

Scaling laws: GPT-3 (175B parameters) showed that simply scaling model size, data, and compute leads to emergent capabilities like few-shot learning.

The GPT architecture became the foundation for large language models including GPT-4, Claude, and LLaMA.

Encoder-Decoder vs Decoder-Only

Different architectures suit different tasks:

Encoder-only (BERT): Understanding tasks (classification, NER, question answering).

Decoder-only (GPT): Generation tasks (text completion, creative writing, code generation).

Encoder-decoder (T5): Sequence-to-sequence tasks (translation, summarization). Treats all tasks as text-to-text.

Vision Transformers (ViT)

Adapting Transformers to Images

Dosovitskiy et al. (2021) showed that transformers can work directly on images, without convolutions. This was surprising - CNNs had dominated vision for decades due to strong inductive biases (locality, translation equivariance).

Image Patches as Tokens

The key idea: split an image into fixed-size patches and treat them as tokens.

For an image $x \in \mathbb{R}^{H \times W \times C}$ and patch size $P$, divide into $N = HW/P^2$ patches:

$$x_p \in \mathbb{R}^{N \times (P^2 \cdot C)}$$

Each patch is a vector of length $P^2 \cdot C$ (flattened pixel values).

Patch Embedding

Linearly project each patch to the model dimension $D$:

$$z_0 = [x_p^1 E; x_p^2 E; \ldots; x_p^N E]$$

where $E \in \mathbb{R}^{(P^2 \cdot C) \times D}$ is a learned projection matrix.

Class Token

ViT prepends a learnable class token to the sequence:

$$z_0 = [x_{\text{class}}; x_p^1 E; x_p^2 E; \ldots; x_p^N E]$$

The class token aggregates information from all patches through self-attention. Its final representation is used for classification:

$$y = \text{MLP}(z_L^0)$$

where $z_L^0$ is the class token output from the final layer.

Position Embeddings

Add 1D learnable position embeddings to patch embeddings:

$$z_0 = z_0 + E_{\text{pos}}$$

Interestingly, 2D position embeddings don't significantly outperform 1D. The model learns 2D spatial relationships through attention.

Architecture

After embedding, ViT applies standard Transformer encoder layers (self-attention + FFN with residual connections and layer normalization).

Typical configurations:

Training

ViT requires large-scale pre-training to match CNNs. When trained on ImageNet-1K (1.3M images), ViT underperforms ResNets. But on larger datasets like ImageNet-21K (14M images) or JFT-300M, ViT excels.

Pre-train on large dataset with image classification, then fine-tune on target task. The lack of inductive bias means more data is needed but also more flexibility.

What ViT Learns

Analysis reveals:

Early layers: Attention tends to be local, similar to CNN receptive fields.

Middle layers: Attention becomes more global, integrating information across the image.

Late layers: Attention focuses on semantically relevant regions for the task.

The model learns CNN-like behaviors (locality, hierarchy) from data rather than having them built-in.

Advantages Over CNNs

Global receptive field: Every layer can attend to the entire image, unlike CNNs which build receptive fields gradually.

Scalability: Transformers scale better with model size and data than CNNs.

Interpretability: Attention maps show what the model focuses on.

Transfer learning: Pre-trained ViTs transfer extremely well to diverse downstream tasks.

Disadvantages

Data hungry: Need large datasets to match CNNs.

Computational cost: Quadratic complexity in number of patches. For high-resolution images, this becomes expensive.

Less sample efficient: CNNs work better with small datasets due to inductive biases.

Hybrid and Hierarchical Vision Transformers

Hybrid ViT

Combine CNN feature extraction with transformer processing. Use a CNN to extract features, then apply ViT to the feature maps instead of raw patches.

This provides CNN inductive biases while retaining transformer benefits.

Swin Transformer

Liu et al. (2021) introduced hierarchical vision transformers with shifted windows.

Window attention: Divide image into non-overlapping windows, apply self-attention within each window. Reduces complexity from $O(N^2)$ to $O(N \cdot M)$ where $M$ is window size.

Shifted windows: Alternate between regular and shifted window partitioning across layers. This allows cross-window connections while maintaining efficiency.

Hierarchical structure: Gradually merge patches to create multi-scale representations, similar to CNN feature pyramids.

Swin became the backbone for many vision tasks, achieving state-of-the-art on object detection, segmentation, and dense prediction.

Other Variants

DeiT: Data-efficient image transformers with improved training recipes and distillation.

BEiT: BERT-style pre-training for images using masked patch prediction.

MAE: Masked autoencoders that mask large portions (75%) of patches and reconstruct them.

Diffusion Transformers (DiT)

From U-Nets to Transformers

Diffusion models initially used U-Net architectures borrowed from image segmentation. U-Nets have:

Peebles & Xie (2023) asked: can we replace U-Nets entirely with transformers?

The DiT Architecture

DiT adapts ViT for diffusion models. Key differences from standard ViT:

Input: Noisy images at timestep $t$, conditioned on class label $c$.

Patchify: Same as ViT - divide noisy image into patches and linearly embed.

Conditioning: Inject timestep and class information into the transformer.

Conditioning Mechanisms

DiT explores several ways to inject $t$ and $c$:

In-context conditioning: Append $t$ and $c$ as tokens to the sequence.

Cross-attention: Use $t$ and $c$ as keys/values in cross-attention layers.

Adaptive layer norm (AdaLN): Modulate layer norm parameters based on $t$ and $c$:

$$\text{AdaLN}(h, c) = c_s \cdot \text{LayerNorm}(h) + c_b$$

where $c_s$ (scale) and $c_b$ (bias) are computed from conditioning via an MLP.

AdaLN-Zero: Initialize the final layer of conditioning MLP to zero. This makes the model start as an identity function, improving training stability.

AdaLN-Zero performs best and becomes the standard.

Training

Train to predict noise added to images:

$$\mathcal{L} = \mathbb{E}_{t, x_0, \epsilon, c} \left[ \| \epsilon - \epsilon_\theta(x_t, t, c) \|^2 \right]$$

Same as standard diffusion models, but using a transformer backbone instead of U-Net.

Scaling Properties

DiT reveals that transformers scale better than U-Nets for diffusion:

DiT-XL/2 (675M parameters) achieves state-of-the-art on ImageNet generation.

Why Transformers for Diffusion?

Scalability: Clear path to scaling - add more layers, more parameters.

Unification: Same architecture across modalities. Can leverage pre-training and transfer learning.

Long-range dependencies: Self-attention captures global coherence better than convolutions.

Flexibility: Easy to condition on various inputs (text, images, other modalities) via cross-attention.

Extensions

Text-to-image: Add text encoder and cross-attention to condition on captions.

Latent diffusion with DiT: Combine with VAE latent spaces for efficiency (used in some versions of Stable Diffusion).

Video DiT: Extend to 3D with temporal attention layers.

Common Patterns Across Domains

The Transformer Recipe

Despite different domains, successful transformers share common patterns:

1. Tokenization: Convert input to sequence of tokens (words, patches, latent codes).

2. Embedding: Project tokens to model dimension.

3. Position encoding: Add positional information (learned, sinusoidal, or RoPE).

4. Transformer layers: Stack of self-attention + FFN with residuals and normalization.

5. Output projection: Task-specific head (classification, generation, denoising).

Conditioning Strategies

Different tasks require different conditioning:

Language: Causal masking for generation, bidirectional for understanding.

Vision: Class token for classification, no masking for full image context.

Diffusion: AdaLN for timestep/class, cross-attention for text.

Scaling Principles

Across domains, transformers benefit from:

Performance scales predictably with these factors, following power laws.

Computational Considerations

Complexity Analysis

For sequence length $n$ and model dimension $d$:

For typical models ($d = 768$, $n = 512$), attention dominates. For longer sequences, quadratic complexity becomes prohibitive.

Optimization Techniques

Flash Attention: IO-aware attention implementation, 2-4x speedup with reduced memory.

Gradient checkpointing: Trade compute for memory by recomputing activations during backward pass.

Mixed precision: Use FP16 or BF16 for most operations, FP32 for critical parts.

Efficient attention variants: Linear attention, sparse attention, windowed attention for longer sequences.

Hardware Considerations

Transformers are well-suited to modern accelerators:

However, memory bandwidth can become a bottleneck for attention.

Future Directions

Long Context

Extending to longer sequences remains a key challenge. Approaches include:

Multi-Modal Transformers

Unified architectures for vision, language, and audio are emerging:

Efficient Architectures

Alternatives to standard transformers:

Conclusion

The Transformer architecture, introduced for machine translation in 2017, has become the foundation of modern AI. Its success stems from a simple but powerful idea: attention-based information routing combined with massive scale.

Key principles:

From language (BERT, GPT) to vision (ViT, Swin) to generation (DiT), transformers have conquered AI. The architecture's flexibility and scalability make it the default choice for new domains and tasks.

While challenges remain - especially computational cost for long sequences - transformers are likely to remain central to AI for years to come. Understanding their architecture and design principles is essential for anyone working in modern machine learning.