Artificial Intelligence

Mixture of Depths

A transformer architecture where different tokens use different numbers of layers, allowing the model to spend more computation on complex tokens and less on simple ones.

Why It Matters

MoD makes transformers more efficient by routing easy computations through fewer layers, reducing average inference cost while maintaining quality.

Example

The word 'the' skipping most transformer layers (it is simple) while the word 'paradoxically' passes through all layers (it requires more processing).

Think of it like...

Like an express lane at the airport — passengers with simple cases go through quickly while complex cases get more thorough processing.

Related Terms