Back
arastirmaJune 8, 2026

Mixture-of-Experts (MoE) — Expert Model Collaboration

Mixture-of-Experts (MoE) is a model architecture that improves computational efficiency in large language models while keeping parameter counts high but active computation cost low. It works by activating only specific "expert" sub-networks for each input.

What is Mixture-of-Experts (MoE)?

Mixture-of-Experts (MoE) is a model architecture used in artificial intelligence and machine learning. The core idea is to have multiple "expert" sub-networks working together, where only the most relevant experts are selected and activated for each input.

How Does It Work?

In the MoE architecture, a gating network evaluates each input and determines which experts to use. The process involves:

  1. Input Analysis: The gating network analyzes the incoming data
  2. Expert Selection: Top-K selection identifies the most suitable experts (usually K=1 or K=2)
  3. Weighted Combination: Selected experts' outputs are combined with weights

Advantages

  • Computational Efficiency: In a 1.7T parameter model, only ~37B parameters are active (Mixtral 8x7B)
  • Scalability: Increases parameters without proportionally increasing compute cost
  • Specialization: Each expert can specialize in different data types
  • Low Latency: Sparse activation improves inference speed

Popular MoE Models

  • Mixtral 8x7B (Mistral AI): 46.7B params, 8 experts (2 active)
  • DeepSeek-MoE (DeepSeek): 16B params, 64 experts (6 active)
  • Switch Transformer (Google): 1.6T params, 128 experts (1 active)
  • GLaM (Google): 1.2T params, 64 experts (2 active)
  • Mixtral 8x22B (Mistral AI): 141B params, 8 experts (2 active)

Technical Details

Gating Architectures

  • Top-K Gating: Selects K experts with highest scores
  • Noisy Top-K Gating: Adds noise during training for exploration
  • Expert Choice Routing: Allows experts to select themselves

Load Balancing

  • Auxiliary Loss: Balances load across experts
  • Router Z-Loss: Maintains router logit stability
  • Token Dropping: Drops tokens when experts are overloaded

Use Cases

  • Large Language Models (LLMs)
  • Multilingual translation systems
  • Computer vision
  • Recommendation systems
  • Scientific computing

Resources