Positional Embeddings: Why and How

January 25, 2026

Introduction

Self-attention is permutation-invariant. Shuffle the input sequence and you get the same output. This is a problem: position matters. "The cat chased the mouse" means something different from "The mouse chased the cat".

Positional embeddings solve this by injecting position information into the model. But how to encode position is non-trivial, and the choice significantly impacts model performance, especially on long sequences.

This article covers the evolution of positional embeddings: why we need them, how they work, and the trade-offs between different approaches.

Why Position Matters

The Permutation Invariance Problem

Recall the attention formula:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right) V$$

If we permute the rows of $Q, K, V$ identically, we get a permuted output but the same attention weights. The operation treats input as a set, not a sequence.

For language, this is clearly wrong. Word order carries syntactic and semantic meaning. For images, spatial arrangement matters. For time series, temporal order is critical.

What Information Do We Need?

Position information can take different forms:

Different methods encode different subsets of this information.

Sinusoidal Positional Encodings

The Original Approach

Vaswani et al. (2017) introduced sinusoidal positional encodings in the original Transformer. For position $pos$ and dimension $i$:

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d}}\right)$$
$$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d}}\right)$$

where $d$ is the model dimension. Even dimensions use sine, odd dimensions use cosine.

Why This Formula?

The design has several clever properties:

Different frequencies: Each dimension has a different wavelength, from $2\pi$ to $10000 \cdot 2\pi$. This provides a unique encoding for each position.

Continuous values: Unlike discrete indices, sinusoidal encodings are continuous. Positions that are close have similar encodings.

Relative position: For any fixed offset $k$, $PE_{pos+k}$ can be expressed as a linear function of $PE_{pos}$:

$$PE_{pos+k} = T_k \cdot PE_{pos}$$

where $T_k$ is a fixed transformation matrix. This means the model can learn to attend to relative positions.

Extrapolation: The model can theoretically handle longer sequences than seen during training, since the function is defined for all positions.

How It's Used

The positional encoding is simply added to the input embedding:

$$X = E + PE$$

where $E$ are the token embeddings and $PE$ are the positional encodings. This additive approach is simple but mixes content and position information.

Limitations

While elegant, sinusoidal encodings have drawbacks:

Learned Positional Embeddings

The Simplest Learned Approach

Instead of fixed sinusoids, simply learn a lookup table of position embeddings. For maximum sequence length $N$, learn a matrix $P \in \mathbb{R}^{N \times d}$ where row $i$ is the embedding for position $i$.

This is used in BERT, GPT, and many other models:

$$X = E + P[0:n]$$

where $P[0:n]$ indexes the first $n$ rows of $P$ for a sequence of length $n$.

Advantages

Learned embeddings often work better than sinusoidal in practice. They can adapt to the specific patterns in the data. If certain positions have special significance (e.g., sentence boundaries often at position 0), learned embeddings can capture this.

Disadvantages

The main limitation is fixed maximum length. The model can only handle sequences up to length $N$ seen during training. Longer sequences require extending the embedding table, which typically requires fine-tuning.

Also, learned embeddings don't have the mathematical structure of sinusoidal encodings that enables relative position reasoning.

Relative Positional Encodings

The Core Idea

Shaw et al. (2018) argued that relative positions are more important than absolute positions. Instead of encoding "position 5", encode "3 positions apart".

They modified the attention mechanism to incorporate relative positions:

$$e_{ij} = \frac{x_i W^Q (x_j W^K + r_{i-j}^K)^\top}{\sqrt{d}}$$

where $r_{i-j}^K$ is a learned relative position embedding for offset $i-j$.

Similarly for values:

$$z_i = \sum_j \alpha_{ij} (x_j W^V + r_{i-j}^V)$$

Clipping Relative Positions

To limit the number of parameters, relative positions are often clipped to a maximum distance $k$:

$$\text{clip}(i-j) = \max(-k, \min(k, i-j))$$

This means we only learn separate embeddings for offsets in $[-k, k]$. Larger offsets are treated as equivalent.

Benefits

Relative position encodings are more flexible than absolute positions. The meaning of "3 positions apart" is the same whether you're at position 10 or position 100. This improves generalization to longer sequences.

The model learns to focus on relative relationships, which is often what matters for tasks like syntax understanding.

T5 Relative Position Bias

Simplified Relative Positions

Raffel et al. (2020) introduced a simpler form of relative positions in T5. Instead of modifying keys and values, add a learned bias to attention logits:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}} + b\right) V$$

where $b_{ij}$ is a learned scalar bias that depends only on $i - j$.

Bucketed Relative Positions

T5 uses bucketed relative positions: nearby offsets get unique biases, distant offsets share biases.

For example:

This balances precision for nearby positions with parameter efficiency for distant positions.

Benefits

T5's approach is very simple to implement and adds minimal parameters (typically 32-128 buckets per head). It works well in practice and generalizes to longer sequences reasonably well.

Rotary Position Embeddings (RoPE)

A Geometric Approach

Su et al. (2021) introduced Rotary Position Embeddings (RoPE), which encode position via rotation in the complex plane. This became the standard for many LLMs including LLaMA, GPT-NeoX, and PaLM.

The key insight: represent queries and keys as complex numbers, then rotate them by an angle proportional to their position.

The Mathematics

For a 2D subspace with basis $(q_i, q_{i+1})$, rotation by angle $\theta$ is:

$$\begin{pmatrix} q_i' \\ q_{i+1}' \end{pmatrix} = \begin{pmatrix} \cos\theta & -\sin\theta \\ \sin\theta & \cos\theta \end{pmatrix} \begin{pmatrix} q_i \\ q_{i+1} \end{pmatrix}$$

For position $m$, RoPE uses rotation angles $\theta_i = m / 10000^{2i/d}$ (similar to sinusoidal encodings).

In higher dimensions, apply different rotations to each pair of dimensions. The query at position $m$ becomes:

$$\mathbf{q}_m = R_m \mathbf{q}$$

where $R_m$ is a block-diagonal rotation matrix.

Why Rotation?

The beauty of RoPE is in the attention computation. The dot product between rotated queries and keys at positions $m$ and $n$ is:

$$q_m^\top k_n = (R_m q)^\top (R_n k) = q^\top R_m^\top R_n k = q^\top R_{n-m} k$$

The inner product depends only on the relative position $n - m$, not absolute positions! Rotation naturally encodes relative position information.

Benefits

Relative position awareness: Attention naturally depends on relative positions without explicit bias terms.

No learned parameters: RoPE is a fixed transformation, adding zero parameters.

Length extrapolation: Works better than other methods for sequences longer than training length.

Efficient: Can be implemented efficiently with element-wise operations.

Why RoPE Became Standard

RoPE combines the best aspects of previous approaches: it's parameter-free like sinusoidal encodings, provides relative position information like Shaw et al., and has excellent length extrapolation. It's now the default choice for many large language models.

ALiBi: Attention with Linear Biases

Simplicity Taken Further

Press et al. (2022) proposed ALiBi (Attention with Linear Biases), an even simpler approach: add a constant bias proportional to distance:

$$\text{softmax}(q_i K^\top + m \cdot [0, -1, -2, \ldots, -(i-1)])$$

where $m$ is a head-specific slope. Closer positions get smaller penalties, distant positions get larger penalties.

No Positional Embeddings!

ALiBi doesn't add anything to the input embeddings. All position information comes from the attention bias. This frees up the input embedding space to focus purely on content.

Head-Specific Slopes

Different attention heads use different slopes $m$. Typically, slopes are powers of 2: $m \in \{2^{-1}, 2^{-2}, 2^{-3}, \ldots\}$.

This allows different heads to have different sensitivity to distance. Some heads focus on nearby tokens, others consider wider context.

Exceptional Length Extrapolation

ALiBi's key advantage is length extrapolation. Models trained on sequences of length 512 can inference on lengths of 2048+ with minimal perplexity increase.

Why? The linear bias is well-defined for any distance. The model learns a consistent "distance penalty" that applies to unseen distances.

Trade-offs

ALiBi works excellently for language modeling but may not be optimal for all tasks. Tasks requiring precise absolute position information might benefit from other methods.

Also, the linear decay is a strong inductive bias. For some applications (e.g., very long-range dependencies), this bias might be limiting.

Extended Context Methods

Position Interpolation (PI)

Chen et al. (2023) proposed Position Interpolation for extending RoPE to longer contexts. Instead of extrapolating to unseen positions, interpolate within the seen range.

For training length $L$ and target length $L'$, scale positions:

$$\theta'_i = \frac{m \cdot L}{L'} / 10000^{2i/d}$$

This compresses the position space, allowing longer sequences to use the same rotation angles seen during training. Requires minimal fine-tuning (hundreds of steps) to adapt.

YaRN: Yet another RoPE extensioN

Peng et al. (2023) improved position interpolation with temperature scaling and frequency adjustments. Different frequency components are scaled differently:

This preserves both fine-grained local information and coarse-grained long-range information.

NTK-Aware Interpolation

Based on Neural Tangent Kernel theory, this method adjusts the base frequency ($10000$ in standard RoPE) when interpolating:

$$\text{base}' = \text{base} \times \left(\frac{L'}{L}\right)^{d/(d-2)}$$

This preserves high-frequency information better than naive interpolation.

2D and Multidimensional Position Embeddings

Images: 2D Position Encodings

For images represented as sequences of patches, we need 2D positions. Common approaches:

Separate x and y encodings: Concatenate 1D encodings for horizontal and vertical positions.

2D sinusoidal: Extend sinusoidal formula to 2D:

$$PE(x, y, i) = \sin\left(\frac{x}{10000^{i/d_x}}\right) + \sin\left(\frac{y}{10000^{i/d_y}}\right)$$

Learned 2D embeddings: Learn a table $P[i, j]$ for each (row, column) position.

Video: 3D Spatiotemporal Encodings

Video adds a temporal dimension. Options include:

Point Clouds: Continuous 3D Positions

For 3D point clouds, positions are continuous coordinates $(x, y, z)$. Common approach: apply a coordinate encoding network:

$$PE(x, y, z) = \text{MLP}([x, y, z])$$

Or use Fourier features to encode continuous coordinates.

Conditional and Adaptive Position Encodings

Conditional Position Encodings (CPE)

Chu et al. (2021) proposed generating position encodings conditioned on input content. A small network takes local input features and outputs position encodings.

This allows position representation to adapt based on what's at that position, potentially capturing richer structural information.

Implicit Position Encoding

Some architectures (like Perceiver) use cross-attention from learnable queries to inputs. Position information is implicitly learned in the query structure rather than explicitly added to inputs.

Ablations and Analysis

How Much Does Position Encoding Matter?

Without any position encoding, transformers fail completely on tasks requiring position awareness (accuracy drops to random chance on many NLP tasks).

With position encoding, the choice of method matters but is often less critical than other hyperparameters (model size, training data, etc.) for standard-length sequences.

For long sequences and extrapolation, the choice becomes critical. RoPE and ALiBi significantly outperform learned embeddings.

What Do Models Learn?

Analysis of attention patterns shows:

Different position encoding methods can influence these patterns, but the general hierarchy tends to emerge regardless.

Generalization to Longer Sequences

Learned embeddings generalize poorly beyond training length. Sinusoidal encodings are better but still degrade significantly.

RoPE and especially ALiBi show much better extrapolation. For extending context length, these are the clear winners.

Practical Guidelines

Which Method to Choose?

Standard NLP tasks (fixed length): Learned embeddings work well and are simple. T5 relative bias is also excellent.

Long sequence modeling: RoPE is the current standard. ALiBi is great if you need extreme length extrapolation.

Vision tasks: Learned 2D embeddings or 2D sinusoidal encodings. Many vision transformers don't use explicit position encodings, relying on architectural inductive biases instead.

Multimodal: Often combine different encodings for different modalities (RoPE for text, learned for image patches).

Implementation Considerations

Learned embeddings: Simplest to implement, just create a parameter matrix.

Sinusoidal: Fixed computation, no parameters, easy to implement.

RoPE: More complex implementation but widely available in libraries. Worth the effort for LLMs.

ALiBi: Very simple to implement, just add bias to attention logits.

Future Directions

Infinite Length Models

Current methods still struggle with very long sequences (100k+ tokens). Future work explores:

Better Extrapolation

While RoPE and ALiBi extrapolate well, there's room for improvement. Research directions include:

Task-Specific Encodings

Different tasks might benefit from different position representations. Future work may explore:

Conclusion

Positional embeddings solve a fundamental problem: giving order-agnostic attention mechanisms awareness of position. The field has progressed from simple learned tables to sophisticated rotation-based encodings.

Key insights:

The choice of position encoding significantly impacts a model's ability to handle long sequences and generalize to unseen lengths. As models continue to scale and tackle longer contexts, position encodings remain an active area of research.

For practitioners building transformers today: use RoPE for language models, learned embeddings or no explicit encoding for vision, and ALiBi if extreme length extrapolation is needed. But stay tuned - this field continues to evolve rapidly.