【「DeepSeek 核心技术揭秘」阅读体验】基于MOE混合专家模型的学习和思考-2

2025-8-23 17:00:33 1598 DeepSeek

每个专家网络都需要生成整个输出结果。这意味着专家网络的独立性强，权重更新不再需要考虑其他专家网络的影响。更重要的是，在这种损失函数的训练下，当一个专家网络的误差小于所有专家网络误差的加权平均值时，它的权重就会增大，而当它的误差大于此加权平均值时，它的权重就会减小。所以，使用这种损失函数训练出来的模型，各专家网络之间是竞争关系，而不是合作关系。正是这种“竞争上岗”的模式，形成了动态加载的效果。
每个专家独立计算损失，从而鼓励每个数据样本尽可能被一个专家处理--这种结构不仅提高了模型的效率，还使模型在推理时可以只激活部分专家，从而大幅减少了计算资源的消耗。如同唐僧师徒团队:唐僧名气大、面子大，遇到社交场合，就由唐僧去谈;孙空擅长降妖除，遇到妖怪就请孙悟空出战;沙和尚任劳任怨，脏活累活由沙和尚干;猪八戒好吃懒做，就在团队搞搞气氛。这就是模块解耦要达到的效果。

这里有2个公式，先看看怎么理解：
有一个开源的项目，有对应的代码：

import torch
import torch.nn as nn
import torch.nn.functional as F

class NaiveMoELayer(nn.Module):
    def __init__(self, input_size, output_size, num_experts):
        super().__init__()
        self.num_experts = num_experts
        # 定义多个专家网络（这里用线性层模拟）
        self.experts = nn.ModuleList([
            nn.Linear(input_size, output_size) for _ in range(num_experts)
        ])
        # 定义门控网络，输出每个专家的权重
        self.gate = nn.Linear(input_size, num_experts)

    def forward(self, x):
        # 1. 门控网络计算权重 [batch_size, num_experts]
        gating_weights = F.softmax(self.gate(x), dim=-1) # 这就是公式中的 p_i^c

        # 2. 计算每个专家的输出
        expert_outputs = []
        for expert in self.experts:
            expert_outputs.append(expert(x)) # 每个expert输出是 [batch_size, output_size]
        # 将列表堆叠成一个Tensor: [batch_size, num_experts, output_size]
        expert_outputs = torch.stack(expert_outputs, dim=1)

        # 3. 计算最终输出：加权和
        # 使用 einsum 进行加权求和: 'b n, b n o -> b o'
        # 这行代码直接对应公式中的 ∑_i (p_i^c * o_i^c)
        final_output = torch.einsum('bn, bno -> bo', gating_weights, expert_outputs)

        return final_output, gating_weights

# 使用示例和计算损失
model = NaiveMoELayer(input_size=100, output_size=10, num_experts=4)
input_data = torch.randn(32, 100) # batch_size=32
target = torch.randn(32, 10) # 公式中的 d^c

output, gates = model(input_data)
# 这就是公式中的 E^c = || d^c - ∑_i p_i^c o_i^c ||^2
loss = nn.MSELoss()(output, target)
loss.backward()

那么再看第2个公式：

在pytorch中有对应的代码

计算加权 MSE 损失

import torch
import torch.nn as nn

def weighted_mse_loss(gating_weights, expert_outputs, targets):
    """
    Args:
        gating_weights (torch.Tensor): [batch_size, num_experts] 专家权重
        expert_outputs (torch.Tensor): [batch_size, num_experts, output_dim] 各专家预测
        targets (torch.Tensor): [batch_size, output_dim] 真实值
    Returns:
        torch.Tensor: 加权 MSE 损失
    """
    # 计算每个专家的 MSE
    mse_per_expert = (expert_outputs - targets.unsqueeze(1)) ** 2  # [batch, num_experts, output_dim]
    mse_per_expert = mse_per_expert.mean(dim=-1)  # 沿 output_dim 取平均 [batch, num_experts]

    # 加权求和
    weighted_mse = (gating_weights * mse_per_expert).sum(dim=-1)  # [batch]
    return weighted_mse.mean()  # 批次平均

在 MoE 模型中使用

class MoE(nn.Module):
    def __init__(self, input_dim, output_dim, num_experts):
        super().__init__()
        self.experts = nn.ModuleList([nn.Linear(input_dim, output_dim) for _ in range(num_experts)])
        self.gate = nn.Linear(input_dim, num_experts)

    def forward(self, x, targets=None):
        # 门控网络计算权重
        gating_weights = torch.softmax(self.gate(x), dim=-1)  # [batch, num_experts]

        # 各专家预测
        expert_outputs = torch.stack([expert(x) for expert in self.experts], dim=1)  # [batch, num_experts, output_dim]

        # 最终预测（加权平均）
        output = (gating_weights.unsqueeze(-1) * expert_outputs).sum(dim=1)  # [batch, output_dim]

        # 如果提供 targets，计算加权 MSE
        loss = None
        if targets is not None:
            loss = weighted_mse_loss(gating_weights, expert_outputs, targets)
        
        return output, loss