Back to TIL
March 2026
posted on 03.30.2026

Mixture-of-Experts Architecture in LLMs

Mixture-of-Experts (MoE) is an architecture where a model is divided into smaller "expert" sub-networks instead of one monolithic neural network. For any given input, a gating mechanism routes tokens to only the most relevant experts.

Input → Gating Network → selects top-k experts → Expert outputs → Combined output

Key properties:

  • Sparse activation: Only a fraction of the total parameters are used per forward pass. A 1.8T parameter MoE model might only activate 200B parameters per token.
  • Efficiency: Training and inference are faster per FLOP because most of the network is dormant for any given input.
  • Specialization: Experts learn to handle different types of inputs (code vs prose vs math) without explicit assignment.

Google pioneered this in deep learning through a lineage of papers:

  1. Sparsely-Gated MoE (Shazeer et al., 2017) — introduced trainable gating
  2. GShard (Lepikhin et al., 2020) — scaled MoE to 600B parameters
  3. Switch Transformer (Fedus et al., 2021) — simplified to top-1 expert routing
  4. Gemini 1.5 (2024) — production MoE, 1.5 Pro matches Ultra performance at lower compute

The tradeoff: MoE models have large total parameter counts (need more memory to load), but use fewer FLOPs per token (faster inference). This is why Gemini 1.5 Pro can match 1.0 Ultra quality while being cheaper to run.

No reactions yet

in Naperville, IL
Last visitor from Mitaka, Japan
⌘K