← Back
arastirmaJune 8, 2026
Mixture-of-Experts (MoE) — Expert Model Collaboration
Mixture-of-Experts (MoE) is a model architecture that improves computational efficiency in large language models while keeping parameter counts high but active computation cost low. It works by activating only specific "expert" sub-networks for each input.
What is Mixture-of-Experts (MoE)?
Mixture-of-Experts (MoE) is a model architecture used in artificial intelligence and machine learning. The core idea is to have multiple "expert" sub-networks working together, where only the most relevant experts are selected and activated for each input.
How Does It Work?
In the MoE architecture, a gating network evaluates each input and determines which experts to use. The process involves:
- Input Analysis: The gating network analyzes the incoming data
- Expert Selection: Top-K selection identifies the most suitable experts (usually K=1 or K=2)
- Weighted Combination: Selected experts' outputs are combined with weights
Advantages
- Computational Efficiency: In a 1.7T parameter model, only ~37B parameters are active (Mixtral 8x7B)
- Scalability: Increases parameters without proportionally increasing compute cost
- Specialization: Each expert can specialize in different data types
- Low Latency: Sparse activation improves inference speed
Popular MoE Models
- Mixtral 8x7B (Mistral AI): 46.7B params, 8 experts (2 active)
- DeepSeek-MoE (DeepSeek): 16B params, 64 experts (6 active)
- Switch Transformer (Google): 1.6T params, 128 experts (1 active)
- GLaM (Google): 1.2T params, 64 experts (2 active)
- Mixtral 8x22B (Mistral AI): 141B params, 8 experts (2 active)
Technical Details
Gating Architectures
- Top-K Gating: Selects K experts with highest scores
- Noisy Top-K Gating: Adds noise during training for exploration
- Expert Choice Routing: Allows experts to select themselves
Load Balancing
- Auxiliary Loss: Balances load across experts
- Router Z-Loss: Maintains router logit stability
- Token Dropping: Drops tokens when experts are overloaded
Use Cases
- Large Language Models (LLMs)
- Multilingual translation systems
- Computer vision
- Recommendation systems
- Scientific computing
Resources
- Mixtral: https://mistral.ai/news/mixtral-of-experts/
- Switch Transformer: https://arxiv.org/abs/2101.03961
- GLaM: https://arxiv.org/abs/2112.06905