Conditional Generative Models

January 11, 2026

Introduction

Unconditional generation is impressive, but real applications demand control. We don't just want any face or video - we want a specific person, scene, or motion. Conditional generative models solve this by learning $p(x|c)$: the distribution of outputs $x$ given conditioning information $c$.

The conditioning signal $c$ can take many forms: class labels, text descriptions, reference images, sketches, poses, depth maps, or even other videos. The field has evolved from simple class conditioning to sophisticated multi-modal control, enabling applications from text-to-image synthesis to controllable video generation.

This article covers the complete landscape of conditional generation, from early image models to modern video synthesis systems.

Foundations of Conditional Generation

The Conditional Modeling Problem

Given conditioning information $c$ (e.g., a class label, text, or image), we want to model:

$$p_\theta(x | c)$$

The key challenge is to ensure the generated output $x$ is both high-quality and faithful to the condition $c$. Different model families handle this differently.

Conditioning Mechanisms

There are several ways to incorporate conditioning information into a generative model:

Concatenation: Simply concatenate $c$ with the input or latent code. For images, this might mean concatenating a class embedding to the noise vector $z$.

Conditional batch normalization: Modulate the batch normalization parameters based on $c$. This allows $c$ to control the statistics of feature maps throughout the network.

Cross-attention: For structured conditions like text, use cross-attention mechanisms where image features attend to text embeddings.

Adaptive layer normalization (AdaLN): Scale and shift normalization parameters based on conditioning, widely used in diffusion transformers.

Conditional GANs

Class-Conditional GANs

Mirza & Osindero (2014) introduced Conditional GAN (cGAN), extending the GAN framework to conditional generation. Both generator and discriminator receive the conditioning information:

$$G: (z, c) \to x$$

$$D: (x, c) \to [0, 1]$$

The discriminator must distinguish real samples from the correct class versus fake samples, while the generator tries to fool it:

$$\min_G \max_D \mathbb{E}_{x,c} [\log D(x, c)] + \mathbb{E}_{z,c} [\log(1 - D(G(z, c), c))]$$

Auxiliary Classifier GANs (AC-GAN)

Odena et al. (2017) proposed AC-GAN, which adds a classification head to the discriminator. The discriminator must both detect fakes and correctly classify the condition:

$$\mathcal{L}_D = \mathcal{L}_{\text{src}} + \mathcal{L}_{\text{cls}}$$

where $\mathcal{L}_{\text{src}}$ is the real/fake loss and $\mathcal{L}_{\text{cls}}$ is the classification loss. This encourages the generator to produce class-discriminative features.

Projection Discriminator

Miyato & Koyama (2018) introduced a more principled conditioning approach via inner products. The discriminator computes:

$$D(x, y) = \phi(x)^\top \psi(y) + b$$

where $\phi(x)$ are image features and $\psi(y)$ are class embeddings. This projection-based conditioning proved more stable and effective than concatenation.

StyleGAN and Conditional Control

StyleGAN (Karras et al., 2019) introduced style-based generation, where a mapping network transforms $z \to w$, and the generator is conditioned on $w$ at each layer via adaptive instance normalization (AdaIN):

$$\text{AdaIN}(x, w) = w_s \frac{x - \mu(x)}{\sigma(x)} + w_b$$

where $w_s$ and $w_b$ are learned from $w$. This provides fine-grained control over generation at different scales, enabling techniques like style mixing and latent space editing.

Text-to-Image Generation

Early Approaches: Text Encodings

Text conditioning requires encoding variable-length descriptions into fixed representations. Early methods used:

Character-level CNNs: Reed et al. (2016) encoded text character-by-character
Recurrent encoders: LSTMs or GRUs to process text sequences
Pre-trained embeddings: Word2Vec or GloVe for word representations

These representations were concatenated with noise or used for conditional batch normalization.

Attention-Based Text Conditioning

AttnGAN (Xu et al., 2018) introduced cross-attention between image features and word embeddings. At each layer, image features attend to relevant words:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d}}\right) V$$

where $Q$ comes from image features and $K, V$ from text embeddings. This allows fine-grained alignment between visual regions and words.

DALL-E and Discrete Representations

DALL-E (Ramesh et al., 2021) took a different approach: learn discrete image tokens via VQ-VAE, then use an autoregressive transformer to model the joint distribution of text and image tokens:

$$p(x | c) = \prod_{i=1}^{n} p(x_i | x_{<i}, c)$$

This allowed leveraging the power of large language models for image generation, achieving remarkable compositionality and zero-shot generalization.

CLIP and Vision-Language Alignment

CLIP (Radford et al., 2021) learned a shared embedding space for images and text by training on 400M image-text pairs with a contrastive loss:

$$\mathcal{L} = -\log \frac{\exp(\text{sim}(I_i, T_i) / \tau)}{\sum_{j} \exp(\text{sim}(I_i, T_j) / \tau)}$$

CLIP embeddings became the standard for text conditioning in diffusion models, providing rich semantic guidance.

Conditional Diffusion Models

Classifier Guidance

Dhariwal & Nichol (2021) introduced classifier guidance for conditional diffusion. Given a pre-trained classifier $p(c | x_t)$, we can guide the reverse process using gradients:

$$\nabla_{x_t} \log p(x_t | c) = \nabla_{x_t} \log p(x_t) + \nabla_{x_t} \log p(c | x_t)$$

The score estimate becomes:

$$\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t) - \sqrt{1 - \bar{\alpha}_t} \nabla_{x_t} \log p_\phi(c | x_t)$$

A guidance scale $s$ controls the strength: $\tilde{\epsilon} = \epsilon - s \sqrt{1 - \bar{\alpha}_t} \nabla \log p_\phi(c | x_t)$.

Classifier-Free Guidance

Ho & Salimans (2021) eliminated the need for a separate classifier. Train a single conditional model on both conditioned and unconditioned data (randomly dropping $c$ during training).

At inference, extrapolate away from the unconditional prediction:

$$\tilde{\epsilon}_\theta(x_t, t, c) = \epsilon_\theta(x_t, t, \emptyset) + w \cdot (\epsilon_\theta(x_t, t, c) - \epsilon_\theta(x_t, t, \emptyset))$$

where $w$ is the guidance weight. This became the standard approach, used in Stable Diffusion, DALL-E 2, and Imagen.

Cross-Attention in Diffusion U-Nets

Text-to-image diffusion models typically use U-Net architectures with cross-attention layers. At each resolution level:

Self-attention over spatial locations
Cross-attention to text embeddings
Feed-forward layers

The cross-attention allows image features to selectively attend to relevant text tokens, enabling fine-grained text-image alignment.

Latent Diffusion Models (Stable Diffusion)

Rombach et al. (2022) proposed operating diffusion in a learned latent space rather than pixel space. This reduces computational cost while maintaining quality.

The pipeline: Encode image $x$ to latent $z$ via VAE encoder, run diffusion in latent space, decode back to pixels. Text conditioning uses CLIP text encoder with cross-attention.

Spatial Control in Image Generation

ControlNet: Adding Spatial Conditions

Zhang et al. (2023) introduced ControlNet for adding spatial conditioning signals (edges, depth, pose, segmentation) to pre-trained diffusion models without fine-tuning the original weights.

The architecture creates a trainable copy of the encoder:

Original frozen weights: $F(\cdot; \Theta)$
Trainable copy: $F(\cdot; \Theta_c)$ initialized from $\Theta$
Zero convolution layers to connect them: $\mathcal{Z}(\cdot; \Theta_z)$ initialized to zero

The conditioning signal $c$ is processed by the trainable copy, and results are added via zero convolutions:

$$y = F(x; \Theta) + \mathcal{Z}(F(x + \mathcal{Z}(c; \Theta_{z1}); \Theta_c); \Theta_{z2})$$

This preserves the original model while learning to incorporate new conditions.

T2I-Adapter and Lightweight Control

Mou et al. (2023) proposed a more lightweight approach: small adapter networks that inject spatial features into the diffusion U-Net at multiple scales, requiring fewer parameters than ControlNet.

IP-Adapter: Image Prompting

Ye et al. (2023) enabled image conditioning via decoupled cross-attention. Image features from CLIP image encoder are used in separate cross-attention layers alongside text features, allowing image prompting while maintaining text control.

Video Generation: Temporal Conditioning

The Video Generation Challenge

Video generation adds temporal consistency requirements. We need to model not just $p(x_i | c)$ for each frame, but $p(x_1, \ldots, x_T | c)$ with temporal coherence.

Key challenges:

Temporal consistency across frames
Motion modeling and dynamics
Computational cost (many frames to process)
Long-range dependencies in time

Frame-by-Frame with Temporal Conditioning

Early approaches generated video frame-by-frame, conditioning each frame on previous frames:

$$p(x_1, \ldots, x_T | c) = \prod_{t=1}^T p(x_t | x_{<t}, c)$$

This ensures consistency but can suffer from error accumulation and lacks global temporal planning.

3D Convolutions and Temporal Attention

Extending 2D architectures to 3D (treating time as an extra dimension) allows joint spatial-temporal processing:

3D CNNs: Conv3D operations over (H, W, T)
Factored 3D: Separate spatial (2D) and temporal (1D) convolutions
Temporal attention: Self-attention across time dimension

Video Diffusion Models

Ho et al. (2022) extended image diffusion to video by adding temporal layers. The U-Net architecture becomes:

Spatial layers (2D convolution + self-attention over pixels)
Temporal layers (1D convolution + self-attention over time)
Factored space-time attention: attend spatially, then temporally

This allows leveraging pre-trained image models by only adding temporal layers.

Text-to-Video with Cascaded Models

Imagen Video (Ho et al., 2022) uses a cascade of models:

Base model: text to low-resolution video (16 frames, 24x24)
Temporal super-resolution: increase frame rate
Spatial super-resolution: increase resolution

Each stage is a video diffusion model conditioned on text and lower-resolution outputs.

Latent Video Diffusion

Blattmann et al. (2023) extended Latent Diffusion to video, operating diffusion in a compressed latent space. This significantly reduces computational cost while maintaining quality.

Architecture: 3D VAE for video compression, diffusion in latent space with temporal attention, text conditioning via CLIP.

Stable Video Diffusion (SVD)

Blattmann et al. (2023) introduced SVD, fine-tuning Stable Diffusion for video generation. Key insights:

Curated pretraining data crucial for motion quality
Frame-by-frame noise augmentation helps consistency
Temporal layers can be efficiently added to image models

SVD enables both text-to-video and image-to-video generation.

Sora and Large-Scale Video Generation

OpenAI's Sora represents the state-of-the-art in text-to-video synthesis. While architectural details are limited, key ideas include:

Diffusion transformers operating on spacetime patches
Variable resolution and duration training
Massive scale training on diverse video data
Understanding physics and 3D consistency from data

Motion and Temporal Control

Motion Conditioning Signals

Beyond text, we can condition video generation on motion information:

Optical flow: Dense motion vectors between frames guide the generation process.

Trajectory control: Specify paths for objects or camera motion.

Pose sequences: For human videos, condition on skeletal pose over time.

Depth sequences: Control 3D structure and camera motion.

AnimateDiff: Motion Modules

Guo et al. (2023) introduced motion modules that can be inserted into pre-trained image diffusion models. The motion module learns temporal consistency patterns that transfer across different visual content.

Key idea: decouple appearance (from image model) and motion (from motion module). This allows combining different appearance models with learned motion priors.

DragNUWA and Controllable Video Editing

Recent work enables fine-grained control via trajectory-based interfaces. Users specify motion by dragging points, and the model generates video following those trajectories.

Multi-Modal Conditioning

Combining Multiple Conditions

Modern systems often combine multiple conditioning signals:

Text + reference image (IP-Adapter)
Text + spatial structure (ControlNet)
Text + motion + style (video models)
Multi-view consistency for 3D generation

The challenge is balancing different signals and avoiding conflicts.

Compositional Generation

Composing multiple concepts remains challenging. Recent approaches:

Attention manipulation: Modify cross-attention maps to control object placement and relationships.

Compositional classifiers: Combine multiple classifier gradients for classifier guidance.

Structured diffusion: Explicitly model scene graphs or object layouts.

Personalization and Few-Shot Adaptation

Textual Inversion (Gal et al., 2022) learns new text tokens representing specific concepts from a few images. DreamBooth (Ruiz et al., 2023) fine-tunes the entire model on a subject with a rare token identifier.

These enable personalized generation: combining learned concepts with pre-trained model capabilities.

Training Strategies

Conditional Dropout

For classifier-free guidance, randomly drop conditioning during training (typically 10-20% of the time). This trains the model to work both conditionally and unconditionally:

$$c \leftarrow \begin{cases} \emptyset & \text{with probability } p_{\text{uncond}} \\ c & \text{otherwise} \end{cases}$$

Progressive Growing and Cascaded Refinement

High-resolution generation often uses cascaded models:

Generate low-resolution output (fast, captures global structure)
Super-resolve to higher resolution (adds detail, conditioned on low-res)
Optionally repeat for multiple resolution stages

This is more efficient than directly generating high-resolution outputs.

Data Curation and Filtering

Recent work emphasizes the importance of high-quality training data:

Filter low-quality or misaligned image-text pairs
Synthetic captions from vision-language models (BLIP, CoCa)
Aesthetic scoring to prioritize visually appealing data
Deduplication and diversity analysis

Evaluation and Metrics

Image Quality Metrics

FID (Fréchet Inception Distance): Measures distribution distance between real and generated images in Inception feature space.

IS (Inception Score): Evaluates quality and diversity based on classifier predictions.

CLIP Score: Measures alignment between generated images and text conditions using CLIP embeddings.

Video Metrics

FVD (Fréchet Video Distance): Extends FID to video using I3D features.

Temporal consistency: Warping error between consecutive frames using optical flow.

Motion quality: Evaluating realistic dynamics and physics.

Human Evaluation

Automated metrics don't capture all aspects of quality. Human evaluation remains crucial for assessing:

Semantic alignment with text
Image aesthetics and artistic quality
Temporal coherence in videos
Absence of artifacts

Applications and Impact

Creative Tools

Text-to-image and text-to-video models enable new creative workflows:

Rapid concept visualization and prototyping
Style transfer and artistic effects
Content generation for games and films
Personalized content creation

Accessibility

These tools democratize content creation, allowing non-experts to generate professional-quality visuals through natural language descriptions.

Scientific Visualization

Conditional models help visualize complex concepts, generate training data for other ML systems, and assist in scientific communication.

Challenges and Future Directions

Current Limitations

Fine-grained control: Precise spatial and temporal control remains difficult. Text is ambiguous; structured interfaces help but add complexity.

3D consistency: Generated videos often lack physical plausibility and 3D coherence across views.

Long videos: Maintaining consistency over minutes or hours is unsolved. Current models handle seconds to tens of seconds.

Computational cost: High-resolution, long-duration video generation requires massive compute.

Emerging Directions

World models: Learning physical dynamics and 3D structure from video data to enable controllable simulation.

Interactive generation: Real-time conditional generation for interactive experiences and games.

Multi-modal understanding: Better integration of vision, language, audio, and other modalities.

Efficiency: Faster generation through distillation, better architectures, and optimized sampling.

Conclusion

Conditional generative models have evolved from simple class-conditioned GANs to sophisticated multi-modal systems that can generate high-quality images and videos from text, spatial controls, and motion specifications. The field continues to advance rapidly.

Key insights:

Effective conditioning requires careful architectural choices: cross-attention for text, spatial controls via ControlNet, temporal layers for video
Classifier-free guidance enables strong conditioning without auxiliary models
Latent diffusion dramatically reduces computational cost while maintaining quality
Video generation requires explicit temporal modeling and consistency mechanisms
Data quality and curation are as important as model architecture

The frontier is moving toward more controllable, efficient, and physically plausible generation. As models continue scaling and techniques improve, we approach the goal of generating any visual content from high-level specifications.