Today I Learned

A collection of short notes from my cross-disciplinary studies, shared as I learn in public.

January 1, 2024 - April 20, 2026

··

March 2026

posted on 03.30.2026

Mixture-of-Experts Architecture in LLMs

Mixture-of-Experts (MoE) is an architecture where a model is divided into smaller "expert" sub-networks instead of one monolithic neural network. For any given input, a gating mechanism routes tokens to only the most relevant experts.

Input → Gating Network → selects top-k experts → Expert outputs → Combined output

Key properties:

Sparse activation: Only a fraction of the total parameters are used per forward pass. A 1.8T parameter MoE model might only activate 200B parameters per token.
Efficiency: Training and inference are faster per FLOP because most of the network is dormant for any given input.
Specialization: Experts learn to handle different types of inputs (code vs prose vs math) without explicit assignment.

Google pioneered this in deep learning through a lineage of papers:

Sparsely-Gated MoE (Shazeer et al., 2017) — introduced trainable gating
GShard (Lepikhin et al., 2020) — scaled MoE to 600B parameters
Switch Transformer (Fedus et al., 2021) — simplified to top-1 expert routing
Gemini 1.5 (2024) — production MoE, 1.5 Pro matches Ultra performance at lower compute

The tradeoff: MoE models have large total parameter counts (need more memory to load), but use fewer FLOPs per token (faster inference). This is why Gemini 1.5 Pro can match 1.0 Ultra quality while being cheaper to run.

No reactions yet

Today I Learned

Status Indicator

Confidence Rating

Importance Rating

Mixture-of-Experts Architecture in LLMs