计算向量场的散度：PyTorch 实现与技巧

在许多机器学习任务中，例如像 score matching 这样的生成模型训练方法中，我们经常需要计算模型输出（例如 score function $s_\theta(\boldsymbol x)$ ）的散度 $\nabla_\boldsymbol x \cdot s_\theta(\boldsymbol x)$ 。这一项通常作为损失函数的一部分，例如在 Denoising Score Matching (DSM) 或 Sliced Score Matching (SSM) 中，它帮助模型学习数据的内在结构。

直接精确计算高维向量场的散度，特别是当 $s_\theta(\boldsymbol x)$ 是一个复杂的神经网络时，可能非常耗费计算资源。因为它涉及到计算雅可比矩阵（Jacobian matrix）并对其对角线元素求和。因此，我们需要一些有效的方法来计算或估计这个散度项。

本文将探讨在 PyTorch 中处理这一问题的几种策略，从直接的数值方法到更高级的随机估计技术。

Preliminary

回忆一下，对于一个 $n$ 维向量场 $\mathbf{F}(\boldsymbol x) = (F_1(\boldsymbol x), F_2(\boldsymbol x), \dots, F_n(\boldsymbol x))$ ，其中 $\boldsymbol x = (x_1, x_2, \dots, x_n)$ ，其散度定义为：

\text{div}\mathbf{F} =\nabla \cdot \mathbf{F} = \sum_{i=1}^{n} \frac{\partial F_i}{\partial x_i}

散度衡量了向量场在某一点的“源”（正散度）或“汇”（负散度）的强度。在物理学中，它描述了场的通量密度；在机器学习中，它可以作为正则项或目标函数的一部分。

PyTorch 中的实现方法

根据向量场的形式（是离散数据网格还是函数输出），我们可以选择不同的计算方法。

方法1：使用 `torch.gradient` 计算离散场散度

若向量场是定义在一个规则网格上的离散数据（例如，来自物理模拟的输出或图像数据），PyTorch 1.8 版本以上提供的 torch.gradient 函数是一个方便的工具。它使用二阶中心差分（边界处为一阶）来数值计算梯度。

示例：计算三维空间中向量场的散度

假设我们有一个三维向量场 $\mathbf{F} = (F_x, F_y, F_z)$ ，其数据存储在一个形状为 (batch_size, 3, D_x, D_y, D_z) 的张量 field 中。这里 3 代表向量的三个分量 $F_x, F_y, F_z$ ，而 D_x, D_y, D_z 是三个空间维度的大小。

import torch

def compute_divergence_from_grid(field_tensor, spacing=1.0):
    """
    计算离散向量场的散度。
    field_tensor: 形状为 (B, N_dim, D1, D2, ..., DN_dim) 的张量。
                  例如，对于3D向量场，形状为 (B, 3, Dx, Dy, Dz)。
                  N_dim 是向量场的维度 (例如，3D向量场是3)。
                  D1, ..., DN_dim 是空间网格的维度。
    spacing: 网格点之间的间距，可以是单个标量或每个维度的间距元组。
    返回散度张量，形状为 (B, D1, D2, ..., DN_dim)。
    """
    if not isinstance(spacing, tuple):
        spacing = (spacing,) * (field_tensor.dim() - 2) # Assumes spatial dims start from index 2

    # 例如，对于 (B, 3, Dx, Dy, Dz) 的3D向量场:
    # F_x 是 field_tensor[:, 0], shape (B, Dx, Dy, Dz)
    # F_y 是 field_tensor[:, 1], shape (B, Dx, Dy, Dz)
    # F_z 是 field_tensor[:, 2], shape (B, Dx, Dy, Dz)

    # 注意：torch.gradient 的 dim 参数指的是输入到该函数的张量的维度。
    # 例如，fx = field_tensor[:, 0] (shape B, Dx, Dy, Dz)
    # dfx_dx (对第一个空间维度 Dx 求导) 使用 dim=1 (0-indexed for Dx)
    # dfy_dy (对第二个空间维度 Dy 求导) 使用 dim=2 (0-indexed for Dy)
    # dfz_dz (对第三个空间维度 Dz 求导) 使用 dim=3 (0-indexed for Dz)

    # 假设 field_tensor 的维度顺序是 [batch, component, spatial_dim_1, spatial_dim_2, ...]
    # 我们需要对每个分量 F_i 求关于其对应空间维度 x_i 的偏导数。
    
    divergence = torch.zeros_like(field_tensor[:, 0]) # Shape (B, D1, D2, ...)
    
    # Generalizing for N_dim vector field on N_dim grid
    # Assumes field_tensor[:, i] is F_i, and we want dF_i / d(spatial_dim_i)
    # Spatial dimensions in field_tensor[:, i] start from its dim 1.
    for i in range(field_tensor.shape[1]): # Iterate over vector components
        # field_tensor[:, i] is the i-th component, e.g., F_x, F_y, F_z
        # We take its gradient along the (i+1)-th dimension of this sliced tensor
        # (which corresponds to the (i+2)-th dimension of the original field_tensor)
        # e.g., for F_x (i=0), take gradient along its dim 1 (Dx)
        # for F_y (i=1), take gradient along its dim 2 (Dy)
        # for F_z (i=2), take gradient along its dim 3 (Dz)
        
        # Check if current component's spatial dimension exists
        # The spatial dimension for component `i` is `i+1` in the sliced tensor `field_tensor[:, i]`
        if field_tensor.dim() > 2 + i : # 2 accounts for batch and component dims
             # The i-th component (e.g., Fx) is field_tensor[:, i]
             # Its corresponding spatial dimension is the (i+1)-th dimension of this sub-tensor
            partial_derivative = torch.gradient(field_tensor[:, i], 
                                                spacing=spacing[i] if isinstance(spacing, tuple) and len(spacing) > i else spacing, 
                                                dim=i+1)[0]
            divergence += partial_derivative
        else:
            # This case might occur if N_dim (number of components) is larger than
            # the number of spatial dimensions in the grid, which is unusual for divergence.
            # Or if spacing tuple is not correctly sized.
            print(f"Warning: Component {i} does not have a corresponding spatial dimension for gradient calculation or spacing issue.")


    return divergence

# 示例输入 (3D 向量场)
batch_size, Dx, Dy, Dz = 2, 10, 10, 10
example_field = torch.randn(batch_size, 3, Dx, Dy, Dz) # B, 3 components, Dx, Dy, Dz grid
dx, dy, dz = 0.1, 0.1, 0.1 # Spacing for each dimension

div_grid = compute_divergence_from_grid(example_field, spacing=(dx, dy, dz))
# print(f"Divergence from grid shape: {div_grid.shape}") # Expected: (B, Dx, Dy, Dz)

适用场景：当向量场是预先计算好的、存储在网格上的数据时，此方法非常直接。例如，在图像处理中计算光流场的散度，或在流体力学模拟中分析速度场

方法2：使用 `torch.autograd` 和 Hutchinson 随机估计器 (针对模型输出)

在 score matching 等场景中，向量场 $s_\theta(\boldsymbol x)$ 通常是由一个神经网络（参数为 $\theta$ ）针对输入 $\boldsymbol x$ 计算得到的。这里， $\boldsymbol x$ 通常是一个扁平化的向量 (e.g., $R^D$ )，而 $s_\theta(\boldsymbol x)$ 的输出也具有相同的维度 $R^D$ 。我们感兴趣的是 $\text{Tr}(\boldsymbol{J}_{\boldsymbol x} s_\theta(\boldsymbol x))$ ，即 $s_\theta(\boldsymbol x)$ 关于 $\boldsymbol x$ 的雅可比矩阵的迹。

直接计算完整的雅可比矩阵然后取迹，对于高维输入（例如图像）来说代价高昂。Hutchinson 随机迹估计器提供了一个可行的解决方案：

对于一个矩阵 $\boldsymbol{A}$ ，其迹 $\text{Tr}(\boldsymbol{A})$ 可以被估计为 $\mathbb{E}_{\boldsymbol{\epsilon} \sim p(\boldsymbol{\epsilon})}[\boldsymbol{\epsilon}^T \boldsymbol{A} \boldsymbol{\epsilon}]$ ，其中 $\boldsymbol{\epsilon}$ 是一个随机向量，其分量独立同分布，均值为0，方差为1（例如，来自标准正态分布或 Rademacher 分布）。
在实践中，我们通常用少量样本（甚至一个样本）的均值来近似这个期望。

对于散度 $\nabla_{\boldsymbol x} \cdot s_\theta(\boldsymbol x) = \text{Tr}(\boldsymbol{J}_{\boldsymbol x} s_\theta(\boldsymbol x))$ ，Hutchinson 估计的关键在于高效计算 $\boldsymbol{\epsilon}^T (\boldsymbol{J}_{\boldsymbol x} s_\theta(\boldsymbol x)) \boldsymbol{\epsilon}$ 。这可以通过两次向量-雅可比积 (vector-Jacobian products, VJPs) 或者一次雅可比-向量积 (Jacobian-vector product, JVP) 和一次点积来完成。一个常用的技巧是利用 $\boldsymbol{\epsilon}^T \boldsymbol{J} \boldsymbol{\epsilon} = \boldsymbol{\epsilon}^T \nabla_{\boldsymbol x} (s_\theta(\boldsymbol x)^T \boldsymbol{\epsilon}_{\text{fixed}})$ , 其中 $\boldsymbol{\epsilon}_{\text{fixed}}$ 在求导时被视为常数。

import torch

def hutchinson_divergence_estimator(model, x, num_samples=1):
    """
    使用Hutchinson随机估计器估计模型输出关于其输入的散度 Tr(J_x model(x)).
    model: 一个可调用对象 (例如 nn.Module), 输入 x, 输出与 x 形状相同的张量.
    x: 输入张量, shape (B, D_features). 需要 x.requires_grad = True.
    num_samples: 用于估计的随机向量样本数.
    返回散度估计值, shape (B,).
    """
    batch_size, feature_dim = x.shape
    total_div_estimate = torch.zeros(batch_size, device=x.device, dtype=x.dtype)

    if not x.requires_grad:
        # 为了计算梯度，x 必须允许梯度传播
        # 调用者应确保 x.requires_grad is True
        # 如果模型内部修改了 x 的 requires_grad 状态，可能会有问题
        # 这里我们临时设置，但更好的做法是在外部处理
        x.requires_grad_(True) 

    for _ in range(num_samples):
        # epsilon 的形状应与 model(x) 的输出相同
        epsilon = torch.randn_like(x) # (B, D_features)
        
        # 确保 epsilon 在计算 s_dot_v 的梯度时不参与求导
        epsilon_fixed = epsilon.detach()

        model_output = model(x) # (B, D_features)

        # 1. 计算 s_theta(x)^T epsilon_fixed
        # model_output is (B, D), epsilon_fixed is (B, D)
        s_dot_epsilon = (model_output * epsilon_fixed).sum(dim=-1) # Shape (B,)

        # 2. 计算 nabla_x (s_theta(x)^T epsilon_fixed)
        # 这是 (B,) 张量关于 (B,D) 张量的梯度.
        grad_s_dot_epsilon = torch.autograd.grad(
            outputs=s_dot_epsilon,
            inputs=x,
            grad_outputs=torch.ones_like(s_dot_epsilon), # 为批次中的每个样本提供梯度
            create_graph=True, # 对于score matching等需要二阶导数或将此项作为loss一部分的场景至关重要
            retain_graph=True  # 如果 x 或 model 在计算图中被多次使用
        )[0] # Shape (B, D_features)
        # grad_s_dot_epsilon 现在是 J^T epsilon

        # 3. 计算 epsilon^T (nabla_x (s_theta(x)^T epsilon_fixed))
        # 即 epsilon^T (J^T epsilon) --- 这不正确，应该是 epsilon^T J epsilon
        # grad_s_dot_epsilon 是 J^T epsilon.
        # 我们需要的是 epsilon^T J epsilon.
        # 实际上，Yang Song 等人的论文中使用的形式是 v^T nabla_x (s(x)^T v),
        # 其中 nabla_x (s(x)^T v) 是 (J^T v).
        # 所以 v^T (J^T v) 是我们计算的。这似乎是标准做法。
        # 另一种解释是，我们计算的是 sum_i v_i * (nabla_x (s(x)^T v))_i
        # (nabla_x (s(x)^T v))_k = d/dx_k (sum_j s_j v_j) = sum_j (ds_j/dx_k) v_j
        # 所以 v^T nabla_x (s(x)^T v) = sum_k v_k (sum_j (ds_j/dx_k) v_j)
        # 这正是 epsilon^T J epsilon (如果 J 是对称的) 或者 epsilon^T J^T epsilon (一般情况).
        # 对于迹估计 Tr(J) = E[eps^T J eps], 这个形式是正确的。

        div_sample = (grad_s_dot_epsilon * epsilon).sum(dim=-1) # Shape (B,)
        total_div_estimate += div_sample
        
    return total_div_estimate / num_samples

# 示例：定义一个简单的模型 (例如，一个线性层)
# class SimpleModel(torch.nn.Module):
#     def __init__(self, dim):
#         super().__init__()
#         self.linear = torch.nn.Linear(dim, dim)
#     def forward(self, x):
#         return self.linear(x)

# feature_dim = 5
# model = SimpleModel(feature_dim)
# input_x = torch.randn(batch_size, feature_dim, requires_grad=True)

# div_hutchinson = hutchinson_divergence_estimator(model, input_x, num_samples=10)
# print(f"Hutchinson divergence estimate (batch): {div_hutchinson}")

# 如果需要精确散度（代价高昂，仅用于低维验证）：
def exact_divergence_auto(model, x):
    """计算精确散度 sum_i d(model(x)_i)/d(x_i)"""
    if not x.requires_grad:
        x.requires_grad_(True)
    
    model_output = model(x)
    B, D = x.shape
    div = torch.zeros(B, device=x.device, dtype=x.dtype)
    for i in range(D):
        # 计算 d(model_output[:, i]) / d x
        # 然后取对角线元素 d(model_output[:, i]) / d x[:, i]
        grad_outputs = torch.zeros_like(model_output)
        grad_outputs[:, i] = 1.0
        
        grads_i_th_output_row_of_jacobian = torch.autograd.grad(
            outputs=model_output, # d(model_output)/dx
            inputs=x,
            grad_outputs=grad_outputs, # selects i-th component of model_output
            create_graph=True,
            retain_graph=True
        )[0] # Shape (B, D), this is (d model_output_i / d x_0, ..., d model_output_i / d x_{D-1})
        div += grads_i_th_output_row_of_jacobian[:, i] # Accumulate J_ii
    return div

# div_exact = exact_divergence_auto(model, input_x)
# print(f"Exact divergence (batch): {div_exact}")

适用场景：当向量场是由神经网络等可微函数定义时，此方法是计算其散度的标准且高效的途径，尤其是在高维输入（如图像、点云）的 score matching 中。

作为损失项使用

最终计算得到的散度（或其估计）通常会作为正则项或目标函数的一部分加入到总损失中：

# 假设 other_loss 是模型的主要损失项
# div_value 是计算得到的散度 (例如，来自 hutchinson_divergence_estimator)
# lambda_div 是一个超参数，用于平衡散度项的权重

# 通常我们希望惩罚散度的绝对值或平方值
divergence_penalty = torch.mean(torch.abs(div_value)) 
# 或者 divergence_penalty = torch.mean(div_value**2)

total_loss = other_loss + lambda_div * divergence_penalty
total_loss.backward() # 进行反向传播

选择 torch.abs 还是 torch.square (或直接使用 div_value 如果其符号有意义) 取决于具体的应用和希望施加的约束。

选择上的考量

方法选择：
- 如果处理的是离散网格数据，torch.gradient 是自然的选择。确保理解其关于边界处理（默认二阶中心差分，边界一阶）和 spacing 参数的正确使用。
- 如果向量场是模型输出 $s_\theta(\boldsymbol x)$ ，且需要 $\text{Tr}(\boldsymbol{J}_{\boldsymbol x} s_\theta(\boldsymbol x))$ ，Hutchinson 估计器 (torch.autograd.grad 的巧妙运用) 是首选，尤其对于高维 $\boldsymbol x$ 。
requires_grad 和 create_graph：
- 当使用 torch.autograd.grad 计算散度（如 Hutchinson 方法）时，输入 x 必须设置 x.requires_grad_(True)。
- 如果散度项本身是损失函数的一部分，并且你需要通过它反向传播来更新模型参数 $\theta$ ，那么在 torch.autograd.grad 调用中设置 create_graph=True 至关重要。这将允许计算更高阶的导数。
Hutchinson 估计器的样本数 (num_samples)：
- 增加 num_samples 可以提高散度估计的准确性，但也会增加计算成本。在实践中，对于随机梯度下降类型的优化，通常使用 num_samples=1 就能取得不错的效果，因为多步迭代的随机性有助于平均掉噪声。
坐标系和维度对应：
- 使用 torch.gradient 时，务必确保 dim 参数正确指向你希望对其求导的空间维度。如 compute_divergence_from_grid 所示，当对张量进行切片后，维度索引会发生变化。
性能：
- 对于大规模数据或复杂模型，向量化操作和利用 GPU 加速是必要的。PyTorch 的设计天然支持这些。
- 精确计算雅可比矩阵的迹（如 exact_divergence_auto）通常非常慢，仅适用于低维调试或验证。

总结

在 PyTorch 中计算散度没有一键式的内置函数，但根据具体情况，我们可以有效地实现它：

对于离散网格数据，torch.gradient 提供了数值解法。
对于神经网络等模型输出的散度（常见于 score matching），基于 torch.autograd.grad 的 Hutchinson 随机估计器是强大且常用的工具。

Flow Matching 入门 (一)：ODE 与概率密度变换上一篇

div_cal 下一篇

计算向量场的散度：PyTorch 实现与技巧

计算向量场的散度：PyTorch 实现与技巧

Preliminary

PyTorch 中的实现方法

方法1：使用 torch.gradient 计算离散场散度

示例：计算三维空间中向量场的散度

方法2：使用 torch.autograd 和 Hutchinson 随机估计器 (针对模型输出)

作为损失项使用

选择上的考量

总结

方法1：使用 `torch.gradient` 计算离散场散度

方法2：使用 `torch.autograd` 和 Hutchinson 随机估计器 (针对模型输出)