Conditional Generative Models
January 11, 2026
Introduction
Unconditional generation is impressive, but real applications demand control. We don't just want any face or video - we want a specific person, scene, or motion. Conditional generative models solve this by learning $p(x|c)$: the distribution of outputs $x$ given conditioning information $c$.
The conditioning signal $c$ can take many forms: class labels, text descriptions, reference images, sketches, poses, depth maps, or even other videos. The field has evolved from simple class conditioning to sophisticated multi-modal control, enabling applications from text-to-image synthesis to controllable video generation.
This article covers the complete landscape of conditional generation, from early image models to modern video synthesis systems.
Foundations of Conditional Generation
The Conditional Modeling Problem
Given conditioning information $c$ (e.g., a class label, text, or image), we want to model:
The key challenge is to ensure the generated output $x$ is both high-quality and faithful to the condition $c$. Different model families handle this differently.
Conditioning Mechanisms
There are several ways to incorporate conditioning information into a generative model:
Concatenation: Simply concatenate $c$ with the input or latent code. For images, this might mean concatenating a class embedding to the noise vector $z$.
Conditional batch normalization: Modulate the batch normalization parameters based on $c$. This allows $c$ to control the statistics of feature maps throughout the network.
Cross-attention: For structured conditions like text, use cross-attention mechanisms where image features attend to text embeddings.
Adaptive layer normalization (AdaLN): Scale and shift normalization parameters based on conditioning, widely used in diffusion transformers.
Conditional GANs
Class-Conditional GANs
Mirza & Osindero (2014) introduced Conditional GAN (cGAN), extending the GAN framework to conditional generation. Both generator and discriminator receive the conditioning information:
The discriminator must distinguish real samples from the correct class versus fake samples, while the generator tries to fool it:
Auxiliary Classifier GANs (AC-GAN)
Odena et al. (2017) proposed AC-GAN, which adds a classification head to the discriminator. The discriminator must both detect fakes and correctly classify the condition:
where $\mathcal{L}_{\text{src}}$ is the real/fake loss and $\mathcal{L}_{\text{cls}}$ is the classification loss. This encourages the generator to produce class-discriminative features.
Projection Discriminator
Miyato & Koyama (2018) introduced a more principled conditioning approach via inner products. The discriminator computes:
where $\phi(x)$ are image features and $\psi(y)$ are class embeddings. This projection-based conditioning proved more stable and effective than concatenation.
StyleGAN and Conditional Control
StyleGAN (Karras et al., 2019) introduced style-based generation, where a mapping network transforms $z \to w$, and the generator is conditioned on $w$ at each layer via adaptive instance normalization (AdaIN):
where $w_s$ and $w_b$ are learned from $w$. This provides fine-grained control over generation at different scales, enabling techniques like style mixing and latent space editing.
Text-to-Image Generation
Early Approaches: Text Encodings
Text conditioning requires encoding variable-length descriptions into fixed representations. Early methods used:
- Character-level CNNs: Reed et al. (2016) encoded text character-by-character
- Recurrent encoders: LSTMs or GRUs to process text sequences
- Pre-trained embeddings: Word2Vec or GloVe for word representations
These representations were concatenated with noise or used for conditional batch normalization.
Attention-Based Text Conditioning
AttnGAN (Xu et al., 2018) introduced cross-attention between image features and word embeddings. At each layer, image features attend to relevant words:
where $Q$ comes from image features and $K, V$ from text embeddings. This allows fine-grained alignment between visual regions and words.
DALL-E and Discrete Representations
DALL-E (Ramesh et al., 2021) took a different approach: learn discrete image tokens via VQ-VAE, then use an autoregressive transformer to model the joint distribution of text and image tokens:
This allowed leveraging the power of large language models for image generation, achieving remarkable compositionality and zero-shot generalization.
CLIP and Vision-Language Alignment
CLIP (Radford et al., 2021) learned a shared embedding space for images and text by training on 400M image-text pairs with a contrastive loss:
CLIP embeddings became the standard for text conditioning in diffusion models, providing rich semantic guidance.
Conditional Diffusion Models
Classifier Guidance
Dhariwal & Nichol (2021) introduced classifier guidance for conditional diffusion. Given a pre-trained classifier $p(c | x_t)$, we can guide the reverse process using gradients:
The score estimate becomes:
A guidance scale $s$ controls the strength: $\tilde{\epsilon} = \epsilon - s \sqrt{1 - \bar{\alpha}_t} \nabla \log p_\phi(c | x_t)$.
Classifier-Free Guidance
Ho & Salimans (2021) eliminated the need for a separate classifier. Train a single conditional model on both conditioned and unconditioned data (randomly dropping $c$ during training).
At inference, extrapolate away from the unconditional prediction:
where $w$ is the guidance weight. This became the standard approach, used in Stable Diffusion, DALL-E 2, and Imagen.
Cross-Attention in Diffusion U-Nets
Text-to-image diffusion models typically use U-Net architectures with cross-attention layers. At each resolution level:
- Self-attention over spatial locations
- Cross-attention to text embeddings
- Feed-forward layers
The cross-attention allows image features to selectively attend to relevant text tokens, enabling fine-grained text-image alignment.
Latent Diffusion Models (Stable Diffusion)
Rombach et al. (2022) proposed operating diffusion in a learned latent space rather than pixel space. This reduces computational cost while maintaining quality.
The pipeline: Encode image $x$ to latent $z$ via VAE encoder, run diffusion in latent space, decode back to pixels. Text conditioning uses CLIP text encoder with cross-attention.
Spatial Control in Image Generation
ControlNet: Adding Spatial Conditions
Zhang et al. (2023) introduced ControlNet for adding spatial conditioning signals (edges, depth, pose, segmentation) to pre-trained diffusion models without fine-tuning the original weights.
The architecture creates a trainable copy of the encoder:
- Original frozen weights: $F(\cdot; \Theta)$
- Trainable copy: $F(\cdot; \Theta_c)$ initialized from $\Theta$
- Zero convolution layers to connect them: $\mathcal{Z}(\cdot; \Theta_z)$ initialized to zero
The conditioning signal $c$ is processed by the trainable copy, and results are added via zero convolutions:
This preserves the original model while learning to incorporate new conditions.
T2I-Adapter and Lightweight Control
Mou et al. (2023) proposed a more lightweight approach: small adapter networks that inject spatial features into the diffusion U-Net at multiple scales, requiring fewer parameters than ControlNet.
IP-Adapter: Image Prompting
Ye et al. (2023) enabled image conditioning via decoupled cross-attention. Image features from CLIP image encoder are used in separate cross-attention layers alongside text features, allowing image prompting while maintaining text control.
Video Generation: Temporal Conditioning
The Video Generation Challenge
Video generation adds temporal consistency requirements. We need to model not just $p(x_i | c)$ for each frame, but $p(x_1, \ldots, x_T | c)$ with temporal coherence.
Key challenges:
- Temporal consistency across frames
- Motion modeling and dynamics
- Computational cost (many frames to process)
- Long-range dependencies in time
Frame-by-Frame with Temporal Conditioning
Early approaches generated video frame-by-frame, conditioning each frame on previous frames:
This ensures consistency but can suffer from error accumulation and lacks global temporal planning.
3D Convolutions and Temporal Attention
Extending 2D architectures to 3D (treating time as an extra dimension) allows joint spatial-temporal processing:
- 3D CNNs: Conv3D operations over (H, W, T)
- Factored 3D: Separate spatial (2D) and temporal (1D) convolutions
- Temporal attention: Self-attention across time dimension
Video Diffusion Models
Ho et al. (2022) extended image diffusion to video by adding temporal layers. The U-Net architecture becomes:
- Spatial layers (2D convolution + self-attention over pixels)
- Temporal layers (1D convolution + self-attention over time)
- Factored space-time attention: attend spatially, then temporally
This allows leveraging pre-trained image models by only adding temporal layers.
Text-to-Video with Cascaded Models
Imagen Video (Ho et al., 2022) uses a cascade of models:
- Base model: text to low-resolution video (16 frames, 24x24)
- Temporal super-resolution: increase frame rate
- Spatial super-resolution: increase resolution
Each stage is a video diffusion model conditioned on text and lower-resolution outputs.
Latent Video Diffusion
Blattmann et al. (2023) extended Latent Diffusion to video, operating diffusion in a compressed latent space. This significantly reduces computational cost while maintaining quality.
Architecture: 3D VAE for video compression, diffusion in latent space with temporal attention, text conditioning via CLIP.
Stable Video Diffusion (SVD)
Blattmann et al. (2023) introduced SVD, fine-tuning Stable Diffusion for video generation. Key insights:
- Curated pretraining data crucial for motion quality
- Frame-by-frame noise augmentation helps consistency
- Temporal layers can be efficiently added to image models
SVD enables both text-to-video and image-to-video generation.
Sora and Large-Scale Video Generation
OpenAI's Sora represents the state-of-the-art in text-to-video synthesis. While architectural details are limited, key ideas include:
- Diffusion transformers operating on spacetime patches
- Variable resolution and duration training
- Massive scale training on diverse video data
- Understanding physics and 3D consistency from data
Motion and Temporal Control
Motion Conditioning Signals
Beyond text, we can condition video generation on motion information:
Optical flow: Dense motion vectors between frames guide the generation process.
Trajectory control: Specify paths for objects or camera motion.
Pose sequences: For human videos, condition on skeletal pose over time.
Depth sequences: Control 3D structure and camera motion.
AnimateDiff: Motion Modules
Guo et al. (2023) introduced motion modules that can be inserted into pre-trained image diffusion models. The motion module learns temporal consistency patterns that transfer across different visual content.
Key idea: decouple appearance (from image model) and motion (from motion module). This allows combining different appearance models with learned motion priors.
DragNUWA and Controllable Video Editing
Recent work enables fine-grained control via trajectory-based interfaces. Users specify motion by dragging points, and the model generates video following those trajectories.
Multi-Modal Conditioning
Combining Multiple Conditions
Modern systems often combine multiple conditioning signals:
- Text + reference image (IP-Adapter)
- Text + spatial structure (ControlNet)
- Text + motion + style (video models)
- Multi-view consistency for 3D generation
The challenge is balancing different signals and avoiding conflicts.
Compositional Generation
Composing multiple concepts remains challenging. Recent approaches:
Attention manipulation: Modify cross-attention maps to control object placement and relationships.
Compositional classifiers: Combine multiple classifier gradients for classifier guidance.
Structured diffusion: Explicitly model scene graphs or object layouts.
Personalization and Few-Shot Adaptation
Textual Inversion (Gal et al., 2022) learns new text tokens representing specific concepts from a few images. DreamBooth (Ruiz et al., 2023) fine-tunes the entire model on a subject with a rare token identifier.
These enable personalized generation: combining learned concepts with pre-trained model capabilities.
Training Strategies
Conditional Dropout
For classifier-free guidance, randomly drop conditioning during training (typically 10-20% of the time). This trains the model to work both conditionally and unconditionally:
Progressive Growing and Cascaded Refinement
High-resolution generation often uses cascaded models:
- Generate low-resolution output (fast, captures global structure)
- Super-resolve to higher resolution (adds detail, conditioned on low-res)
- Optionally repeat for multiple resolution stages
This is more efficient than directly generating high-resolution outputs.
Data Curation and Filtering
Recent work emphasizes the importance of high-quality training data:
- Filter low-quality or misaligned image-text pairs
- Synthetic captions from vision-language models (BLIP, CoCa)
- Aesthetic scoring to prioritize visually appealing data
- Deduplication and diversity analysis
Evaluation and Metrics
Image Quality Metrics
FID (Fréchet Inception Distance): Measures distribution distance between real and generated images in Inception feature space.
IS (Inception Score): Evaluates quality and diversity based on classifier predictions.
CLIP Score: Measures alignment between generated images and text conditions using CLIP embeddings.
Video Metrics
FVD (Fréchet Video Distance): Extends FID to video using I3D features.
Temporal consistency: Warping error between consecutive frames using optical flow.
Motion quality: Evaluating realistic dynamics and physics.
Human Evaluation
Automated metrics don't capture all aspects of quality. Human evaluation remains crucial for assessing:
- Semantic alignment with text
- Image aesthetics and artistic quality
- Temporal coherence in videos
- Absence of artifacts
Applications and Impact
Creative Tools
Text-to-image and text-to-video models enable new creative workflows:
- Rapid concept visualization and prototyping
- Style transfer and artistic effects
- Content generation for games and films
- Personalized content creation
Accessibility
These tools democratize content creation, allowing non-experts to generate professional-quality visuals through natural language descriptions.
Scientific Visualization
Conditional models help visualize complex concepts, generate training data for other ML systems, and assist in scientific communication.
Challenges and Future Directions
Current Limitations
Fine-grained control: Precise spatial and temporal control remains difficult. Text is ambiguous; structured interfaces help but add complexity.
3D consistency: Generated videos often lack physical plausibility and 3D coherence across views.
Long videos: Maintaining consistency over minutes or hours is unsolved. Current models handle seconds to tens of seconds.
Computational cost: High-resolution, long-duration video generation requires massive compute.
Emerging Directions
World models: Learning physical dynamics and 3D structure from video data to enable controllable simulation.
Interactive generation: Real-time conditional generation for interactive experiences and games.
Multi-modal understanding: Better integration of vision, language, audio, and other modalities.
Efficiency: Faster generation through distillation, better architectures, and optimized sampling.
Conclusion
Conditional generative models have evolved from simple class-conditioned GANs to sophisticated multi-modal systems that can generate high-quality images and videos from text, spatial controls, and motion specifications. The field continues to advance rapidly.
Key insights:
- Effective conditioning requires careful architectural choices: cross-attention for text, spatial controls via ControlNet, temporal layers for video
- Classifier-free guidance enables strong conditioning without auxiliary models
- Latent diffusion dramatically reduces computational cost while maintaining quality
- Video generation requires explicit temporal modeling and consistency mechanisms
- Data quality and curation are as important as model architecture
The frontier is moving toward more controllable, efficient, and physically plausible generation. As models continue scaling and techniques improve, we approach the goal of generating any visual content from high-level specifications.