Mixture-of-Experts (MoE) is an architecture where a model is divided into smaller "expert" sub-networks instead of one monolithic neural network. For any given input, a gating mechanism routes tokens to only the most relevant experts.
Sparse activation: Only a fraction of the total parameters are used per forward pass. A 1.8T parameter MoE model might only activate 200B parameters per token.
Efficiency: Training and inference are faster per FLOP because most of the network is dormant for any given input.
Specialization: Experts learn to handle different types of inputs (code vs prose vs math) without explicit assignment.
Google pioneered this in deep learning through a lineage of papers:
Sparsely-Gated MoE (Shazeer et al., 2017) — introduced trainable gating
GShard (Lepikhin et al., 2020) — scaled MoE to 600B parameters
Switch Transformer (Fedus et al., 2021) — simplified to top-1 expert routing
Gemini 1.5 (2024) — production MoE, 1.5 Pro matches Ultra performance at lower compute
The tradeoff: MoE models have large total parameter counts (need more memory to load), but use fewer FLOPs per token (faster inference). This is why Gemini 1.5 Pro can match 1.0 Ultra quality while being cheaper to run.
Mixture-of-Experts Architecture in LLMs
Mixture-of-Experts (MoE) is an architecture where a model is divided into smaller "expert" sub-networks instead of one monolithic neural network. For any given input, a gating mechanism routes tokens to only the most relevant experts.
Key properties:
Google pioneered this in deep learning through a lineage of papers:
The tradeoff: MoE models have large total parameter counts (need more memory to load), but use fewer FLOPs per token (faster inference). This is why Gemini 1.5 Pro can match 1.0 Ultra quality while being cheaper to run.