Back to Home

Today I Learned

A collection of short notes from my cross-disciplinary studies, shared as I learn in public.

status: In Progress

Status Indicator

The status indicator reflects the current state of the work: - Abandoned: Work that has been discontinued - Notes: Initial collections of thoughts and references - Draft: Early structured version with a central thesis - In Progress: Well-developed work actively being refined - Finished: Completed work with no planned major changes This helps readers understand the maturity and completeness of the content.

·
certainty: certain

Confidence Rating

The confidence tag expresses how well-supported the content is, or how likely its overall ideas are right. This uses a scale from "impossible" to "certain", based on the Kesselman List of Estimative Words: 1. "certain" 2. "highly likely" 3. "likely" 4. "possible" 5. "unlikely" 6. "highly unlikely" 7. "remote" 8. "impossible" Even ideas that seem unlikely may be worth exploring if their potential impact is significant enough.

·
importance: 7/10

Importance Rating

The importance rating distinguishes between trivial topics and those which might change your life. Using a scale from 0-10, content is ranked based on its potential impact on: - the reader - the intended audience - the world at large For example, topics about fundamental research or transformative technologies would rank 9-10, while personal reflections or minor experiments might rank 0-1.

Topics
Showing single entry
March 2026
posted on 03.30.2026

Mixture-of-Experts Architecture in LLMs

Mixture-of-Experts (MoE) is an architecture where a model is divided into smaller "expert" sub-networks instead of one monolithic neural network. For any given input, a gating mechanism routes tokens to only the most relevant experts.

Input → Gating Network → selects top-k experts → Expert outputs → Combined output

Key properties:

  • Sparse activation: Only a fraction of the total parameters are used per forward pass. A 1.8T parameter MoE model might only activate 200B parameters per token.
  • Efficiency: Training and inference are faster per FLOP because most of the network is dormant for any given input.
  • Specialization: Experts learn to handle different types of inputs (code vs prose vs math) without explicit assignment.

Google pioneered this in deep learning through a lineage of papers:

  1. Sparsely-Gated MoE (Shazeer et al., 2017) — introduced trainable gating
  2. GShard (Lepikhin et al., 2020) — scaled MoE to 600B parameters
  3. Switch Transformer (Fedus et al., 2021) — simplified to top-1 expert routing
  4. Gemini 1.5 (2024) — production MoE, 1.5 Pro matches Ultra performance at lower compute

The tradeoff: MoE models have large total parameter counts (need more memory to load), but use fewer FLOPs per token (faster inference). This is why Gemini 1.5 Pro can match 1.0 Ultra quality while being cheaper to run.

No reactions yet

in Naperville, IL
Last visitor from Mitaka, Japan