Transformers: Language, Vision, and Diffusion
January 31, 2026
Introduction
The Transformer architecture revolutionized machine learning. Introduced in 2017 for machine translation, it quickly became the foundation for nearly all state-of-the-art models across NLP, computer vision, speech, and multimodal AI.
This article provides a complete understanding of transformers: the original architecture, how it was adapted for vision, and how it powers modern diffusion models. We'll see how the same core principles apply across wildly different domains.
The Original Transformer
Before Transformers: The RNN Era
Before 2017, sequence modeling relied on recurrent neural networks (RNNs). RNNs process sequences one step at a time, maintaining a hidden state:
This sequential nature created problems:
- No parallelization - must process tokens one by one
- Vanishing gradients - information degrades over long sequences
- Fixed-size hidden state creates an information bottleneck
LSTMs and GRUs improved gradient flow but didn't solve the fundamental sequential bottleneck.
Attention Is All You Need
Vaswani et al. (2017) proposed a radical idea: remove recurrence entirely, rely solely on attention mechanisms. The result was the Transformer, named for its ability to transform input sequences to output sequences through attention.
The Encoder-Decoder Architecture
The original Transformer follows an encoder-decoder structure for sequence-to-sequence tasks (e.g., translation).
Encoder: Processes the input sequence and creates contextualized representations.
Decoder: Generates the output sequence autoregressively, conditioned on encoder outputs.
Input Embeddings and Position Encoding
First, convert input tokens to embeddings:
Add positional encodings to inject position information:
Without positional encoding, attention is permutation-invariant and loses all order information.
The Encoder Block
The encoder consists of $N$ identical layers (typically $N=6$). Each layer has two sub-layers:
1. Multi-Head Self-Attention:
with queries, keys, and values all derived from the same input:
Multiple heads allow attending to different representation subspaces:
2. Position-wise Feed-Forward Network:
Applied independently to each position. Typically, the hidden dimension is $4 \times d_{\text{model}}$.
Residual Connections and Layer Normalization:
Each sub-layer uses a residual connection followed by layer normalization:
This helps gradient flow and stabilizes training.
The Decoder Block
The decoder also has $N$ identical layers, but with three sub-layers:
1. Masked Multi-Head Self-Attention: Prevents positions from attending to future positions. Essential for autoregressive generation.
2. Cross-Attention: Attends to encoder outputs. Queries come from decoder, keys and values from encoder:
This allows the decoder to access relevant parts of the input.
3. Position-wise Feed-Forward: Same as in encoder.
Output Layer
A final linear layer projects to vocabulary size, followed by softmax:
Training
Trained with teacher forcing: use ground-truth previous tokens as decoder input, predict next token.
Loss is cross-entropy over the vocabulary:
Why Transformers Work
The Transformer's success comes from several factors:
Parallelization: All positions processed simultaneously, unlike RNNs. This enables scaling to large datasets and models.
Long-range dependencies: Direct connections between any two positions via attention. Information doesn't need to flow through many recurrent steps.
Flexible receptive field: Attention dynamically determines what to focus on, unlike fixed convolutional kernels.
Compositionality: Multiple layers allow hierarchical representations. Early layers capture local patterns, deeper layers capture global structure.
Variants: BERT and GPT
BERT: Encoder-Only Architecture
Devlin et al. (2019) introduced BERT (Bidirectional Encoder Representations from Transformers), using only the encoder stack.
Key innovation: Masked language modeling (MLM). Randomly mask 15% of input tokens, train model to predict them:
where $\mathcal{M}$ is the set of masked positions.
Bidirectional context: Each token attends to all other tokens (no masking). This creates rich contextual representations.
BERT revolutionized NLP by showing that pre-training on unlabeled text followed by fine-tuning on downstream tasks achieves state-of-the-art across many tasks.
GPT: Decoder-Only Architecture
OpenAI's GPT series uses only the decoder stack, trained autoregressively to predict the next token:
Causal masking: Position $i$ only attends to positions $\leq i$. This enables left-to-right generation.
Scaling laws: GPT-3 (175B parameters) showed that simply scaling model size, data, and compute leads to emergent capabilities like few-shot learning.
The GPT architecture became the foundation for large language models including GPT-4, Claude, and LLaMA.
Encoder-Decoder vs Decoder-Only
Different architectures suit different tasks:
Encoder-only (BERT): Understanding tasks (classification, NER, question answering).
Decoder-only (GPT): Generation tasks (text completion, creative writing, code generation).
Encoder-decoder (T5): Sequence-to-sequence tasks (translation, summarization). Treats all tasks as text-to-text.
Vision Transformers (ViT)
Adapting Transformers to Images
Dosovitskiy et al. (2021) showed that transformers can work directly on images, without convolutions. This was surprising - CNNs had dominated vision for decades due to strong inductive biases (locality, translation equivariance).
Image Patches as Tokens
The key idea: split an image into fixed-size patches and treat them as tokens.
For an image $x \in \mathbb{R}^{H \times W \times C}$ and patch size $P$, divide into $N = HW/P^2$ patches:
Each patch is a vector of length $P^2 \cdot C$ (flattened pixel values).
Patch Embedding
Linearly project each patch to the model dimension $D$:
where $E \in \mathbb{R}^{(P^2 \cdot C) \times D}$ is a learned projection matrix.
Class Token
ViT prepends a learnable class token to the sequence:
The class token aggregates information from all patches through self-attention. Its final representation is used for classification:
where $z_L^0$ is the class token output from the final layer.
Position Embeddings
Add 1D learnable position embeddings to patch embeddings:
Interestingly, 2D position embeddings don't significantly outperform 1D. The model learns 2D spatial relationships through attention.
Architecture
After embedding, ViT applies standard Transformer encoder layers (self-attention + FFN with residual connections and layer normalization).
Typical configurations:
- ViT-Base: 12 layers, 768 dim, 12 heads - 86M parameters
- ViT-Large: 24 layers, 1024 dim, 16 heads - 307M parameters
- ViT-Huge: 32 layers, 1280 dim, 16 heads - 632M parameters
Training
ViT requires large-scale pre-training to match CNNs. When trained on ImageNet-1K (1.3M images), ViT underperforms ResNets. But on larger datasets like ImageNet-21K (14M images) or JFT-300M, ViT excels.
Pre-train on large dataset with image classification, then fine-tune on target task. The lack of inductive bias means more data is needed but also more flexibility.
What ViT Learns
Analysis reveals:
Early layers: Attention tends to be local, similar to CNN receptive fields.
Middle layers: Attention becomes more global, integrating information across the image.
Late layers: Attention focuses on semantically relevant regions for the task.
The model learns CNN-like behaviors (locality, hierarchy) from data rather than having them built-in.
Advantages Over CNNs
Global receptive field: Every layer can attend to the entire image, unlike CNNs which build receptive fields gradually.
Scalability: Transformers scale better with model size and data than CNNs.
Interpretability: Attention maps show what the model focuses on.
Transfer learning: Pre-trained ViTs transfer extremely well to diverse downstream tasks.
Disadvantages
Data hungry: Need large datasets to match CNNs.
Computational cost: Quadratic complexity in number of patches. For high-resolution images, this becomes expensive.
Less sample efficient: CNNs work better with small datasets due to inductive biases.
Hybrid and Hierarchical Vision Transformers
Hybrid ViT
Combine CNN feature extraction with transformer processing. Use a CNN to extract features, then apply ViT to the feature maps instead of raw patches.
This provides CNN inductive biases while retaining transformer benefits.
Swin Transformer
Liu et al. (2021) introduced hierarchical vision transformers with shifted windows.
Window attention: Divide image into non-overlapping windows, apply self-attention within each window. Reduces complexity from $O(N^2)$ to $O(N \cdot M)$ where $M$ is window size.
Shifted windows: Alternate between regular and shifted window partitioning across layers. This allows cross-window connections while maintaining efficiency.
Hierarchical structure: Gradually merge patches to create multi-scale representations, similar to CNN feature pyramids.
Swin became the backbone for many vision tasks, achieving state-of-the-art on object detection, segmentation, and dense prediction.
Other Variants
DeiT: Data-efficient image transformers with improved training recipes and distillation.
BEiT: BERT-style pre-training for images using masked patch prediction.
MAE: Masked autoencoders that mask large portions (75%) of patches and reconstruct them.
Diffusion Transformers (DiT)
From U-Nets to Transformers
Diffusion models initially used U-Net architectures borrowed from image segmentation. U-Nets have:
- Encoder-decoder structure with skip connections
- Convolutional layers for spatial processing
- Self-attention at lower resolutions
Peebles & Xie (2023) asked: can we replace U-Nets entirely with transformers?
The DiT Architecture
DiT adapts ViT for diffusion models. Key differences from standard ViT:
Input: Noisy images at timestep $t$, conditioned on class label $c$.
Patchify: Same as ViT - divide noisy image into patches and linearly embed.
Conditioning: Inject timestep and class information into the transformer.
Conditioning Mechanisms
DiT explores several ways to inject $t$ and $c$:
In-context conditioning: Append $t$ and $c$ as tokens to the sequence.
Cross-attention: Use $t$ and $c$ as keys/values in cross-attention layers.
Adaptive layer norm (AdaLN): Modulate layer norm parameters based on $t$ and $c$:
where $c_s$ (scale) and $c_b$ (bias) are computed from conditioning via an MLP.
AdaLN-Zero: Initialize the final layer of conditioning MLP to zero. This makes the model start as an identity function, improving training stability.
AdaLN-Zero performs best and becomes the standard.
Training
Train to predict noise added to images:
Same as standard diffusion models, but using a transformer backbone instead of U-Net.
Scaling Properties
DiT reveals that transformers scale better than U-Nets for diffusion:
- Larger DiT models (more layers/parameters) consistently improve quality
- Scaling follows power laws similar to language models
- Training is more stable than U-Nets at large scale
DiT-XL/2 (675M parameters) achieves state-of-the-art on ImageNet generation.
Why Transformers for Diffusion?
Scalability: Clear path to scaling - add more layers, more parameters.
Unification: Same architecture across modalities. Can leverage pre-training and transfer learning.
Long-range dependencies: Self-attention captures global coherence better than convolutions.
Flexibility: Easy to condition on various inputs (text, images, other modalities) via cross-attention.
Extensions
Text-to-image: Add text encoder and cross-attention to condition on captions.
Latent diffusion with DiT: Combine with VAE latent spaces for efficiency (used in some versions of Stable Diffusion).
Video DiT: Extend to 3D with temporal attention layers.
Common Patterns Across Domains
The Transformer Recipe
Despite different domains, successful transformers share common patterns:
1. Tokenization: Convert input to sequence of tokens (words, patches, latent codes).
2. Embedding: Project tokens to model dimension.
3. Position encoding: Add positional information (learned, sinusoidal, or RoPE).
4. Transformer layers: Stack of self-attention + FFN with residuals and normalization.
5. Output projection: Task-specific head (classification, generation, denoising).
Conditioning Strategies
Different tasks require different conditioning:
Language: Causal masking for generation, bidirectional for understanding.
Vision: Class token for classification, no masking for full image context.
Diffusion: AdaLN for timestep/class, cross-attention for text.
Scaling Principles
Across domains, transformers benefit from:
- More parameters (more layers, wider hidden dimensions)
- More data (pre-training on large datasets)
- More compute (larger batch sizes, longer training)
Performance scales predictably with these factors, following power laws.
Computational Considerations
Complexity Analysis
For sequence length $n$ and model dimension $d$:
- Self-attention: $O(n^2 d)$ time, $O(n^2)$ memory
- Feed-forward: $O(n d^2)$ time
For typical models ($d = 768$, $n = 512$), attention dominates. For longer sequences, quadratic complexity becomes prohibitive.
Optimization Techniques
Flash Attention: IO-aware attention implementation, 2-4x speedup with reduced memory.
Gradient checkpointing: Trade compute for memory by recomputing activations during backward pass.
Mixed precision: Use FP16 or BF16 for most operations, FP32 for critical parts.
Efficient attention variants: Linear attention, sparse attention, windowed attention for longer sequences.
Hardware Considerations
Transformers are well-suited to modern accelerators:
- Matrix multiplications (attention, FFN) are highly parallelizable
- No sequential dependencies enable full GPU utilization
- Large batch sizes improve efficiency
However, memory bandwidth can become a bottleneck for attention.
Future Directions
Long Context
Extending to longer sequences remains a key challenge. Approaches include:
- More efficient attention mechanisms
- Hierarchical and sparse attention patterns
- State space models as attention alternatives
- Compression and retrieval-augmented methods
Multi-Modal Transformers
Unified architectures for vision, language, and audio are emerging:
- Flamingo: Vision-language with interleaved inputs
- CLIP: Contrastive vision-language pre-training
- Unified-IO: Single model for diverse vision and language tasks
Efficient Architectures
Alternatives to standard transformers:
- Mixture of Experts (MoE) for conditional computation
- State space models (Mamba) with linear complexity
- Hybrid architectures combining transformers with other components
Conclusion
The Transformer architecture, introduced for machine translation in 2017, has become the foundation of modern AI. Its success stems from a simple but powerful idea: attention-based information routing combined with massive scale.
Key principles:
- Self-attention enables flexible, long-range information flow
- Parallelization allows scaling to billions of parameters
- The same architecture adapts across domains with minimal modification
- Scaling model size, data, and compute leads to consistent improvements
From language (BERT, GPT) to vision (ViT, Swin) to generation (DiT), transformers have conquered AI. The architecture's flexibility and scalability make it the default choice for new domains and tasks.
While challenges remain - especially computational cost for long sequences - transformers are likely to remain central to AI for years to come. Understanding their architecture and design principles is essential for anyone working in modern machine learning.