Implicit Differentiation

Theory

At convergence, $x^*(\theta)$ satisfies the first-order optimality condition

\[\nabla_x f(x^*; \theta) = 0\]

on the optimal face. By the implicit function theorem:

\[\frac{\partial x^*}{\partial \theta} = -[\nabla^2_{xx} f]^{-1} \nabla^2_{x\theta} f\]

For the pullback (reverse mode), given cotangent $dx$:

\[d\theta = -\left(\frac{\partial \nabla_x f}{\partial \theta}\right)^\top u, \quad \text{where } \nabla^2_{xx} f \cdot u = dx\]

Implementation

The CG linear solve and cross-derivative computation use DifferentiationInterface:

OperationMapBest modeDI function
Gradient of $f$ (auto)$\mathbb{R}^n \to \mathbb{R}$ForwardDI.gradient!
HVP in CGJVP of $\mathbb{R}^n \to \mathbb{R}^n$ForwardDI.hvp
Cross-derivativegradient of $\theta \mapsto \langle \nabla_x f, u \rangle$Forward or ReverseDI.gradient

The Hessian system is solved by conjugate gradient (CG) with Hessian-vector products, avoiding explicit Hessian construction. Tikhonov regularization $(\nabla^2_{xx} f + \lambda I)$ ensures well-conditioned systems near singular Hessians.

KKT Adjoint for Constrained Solutions

When the solution $x^*$ lies on the boundary of $\mathcal{C}$ (i.e., some constraints are active), the unconstrained optimality condition $\nabla_x f = 0$ no longer holds. Marguerite automatically detects active constraints via active_set and solves the full KKT adjoint system:

\[\begin{bmatrix} \nabla^2_{xx} f & G^\top \\ G & 0 \end{bmatrix} \begin{bmatrix} u \\ \mu \end{bmatrix} = \begin{bmatrix} dx \\ 0 \end{bmatrix}\]

where $G$ is the matrix of active constraint normals. This is solved via a reduced-space approach:

  1. Partition variables into bound (pinned to constraint boundaries) and free
  2. Project $dx_{\text{free}}$ onto the null space of equality constraint normals
  3. CG solve in the reduced space:

    \[(P H_{\text{free}} P + \lambda I)\, w = P dx_{\text{free}}\]

  4. Recover multipliers $\mu$ from the KKT residual

For interior solutions (no active constraints), this reduces to the unconstrained Hessian solve described above.

For custom oracles, Marguerite needs an active_set specialization to know whether the solution lies on a boundary face. If none is defined, differentiated calls fail by default with an actionable error. You can pass assume_interior=true to override this and use the interior approximation, but that is only appropriate when the solution is genuinely interior.

Parametric Constraints

When using a ParametricOracle, the constraint set itself depends on $\theta$. The total gradient has two components:

\[d\theta = d\theta_{\text{obj}} + d\theta_{\text{constraint}}\]

For an active linear face $A(\theta)x = b(\theta)$, the reverse mode pullback is

\[d\theta = -u^\top \partial_\theta \nabla_x f - \lambda^\top (\partial_\theta A)\, u - \mu^\top \bigl((\partial_\theta A)\,x^* - \partial_\theta b\bigr),\]

where $u$ and $\mu$ come from the KKT adjoint solve above and $\lambda$ are the primal active-face multipliers recovered from stationarity.

For simple RHS-parametric constraints this reduces to $d\theta_{\text{constraint}} = \nabla_\theta(\mu^\top h(\theta))$. For constraints with $\theta$-dependent normals (for example ParametricWeightedSimplex with varying $\alpha(\theta)$), the $-\lambda^\top (\partial_\theta A)\,u$ term is also required.

Usage

using Marguerite, LinearAlgebra

f(x, θ) = 0.5 * dot(x, x) - dot(θ, x)
∇f!(g, x, θ) = (g .= x .- θ)

θ = [0.8, 0.2]
x, result = solve(f, ProbSimplex(), [0.5, 0.5], θ; grad=∇f!)

The ChainRulesCore.rrule is defined on the 5-argument solve signatures (those accepting θ). The rrule pullback computes Hessian-vector products internally using SECOND_ORDER_BACKEND (forward-over-forward via AutoForwardDiff).

AD backend selection

Marguerite uses ForwardDiff as the default backend. All AD goes through DifferentiationInterface, so you can override with any DI-compatible backend:

import DifferentiationInterface as DI

x, result = solve(f, ProbSimplex(), [0.5, 0.5];
                   backend=DI.AutoForwardDiff())

Tuning the linear solver

The implicit differentiation backward pass solves a linear system $(\nabla^2_{xx} f + \lambda I) u = dx$ on the reduced (active-face) space.

The rrule and solution_jacobian use a direct Cholesky factorization of the reduced Hessian (falling back to LU if not positive definite). The only tuning parameter is diff_lambda:

KeywordDefaultDescription
diff_lambda1e-4Tikhonov regularization strength

bilevel_solve and bilevel_gradient use iterative CG instead, and accept two additional keywords:

KeywordDefaultDescription
diff_cg_maxiter50Maximum CG iterations
diff_cg_tol1e-6CG convergence tolerance on residual norm

All three can be passed to solve (θ-accepting variants), bilevel_solve, bilevel_gradient, or directly to rrule:

x, result = solve(f, lmo, x0, θ;
                   grad=∇f!, diff_lambda=1e-3)

Bilevel optimization via rrule

For bilevel problems, call the rrule directly to get the pullback. The default backends handle everything automatically — ForwardDiff for gradients and SECOND_ORDER_BACKEND (forward-over-forward) for HVPs:

using ChainRulesCore: rrule

(x_star, result), pb = rrule(solve, f, lmo, x0, θ;
                              grad=∇f!, max_iters=5000)

The pullback accepts a tuple (dx, result_tangent) where dx is the cotangent of the solution and result_tangent is typically nothing:

tangents = pb((dx, nothing))
# tangents = (NoTangent(), NoTangent(), NoTangent(), NoTangent(), dθ)
#             solve      f          lmo        x0         θ

Only (the last element) is nonzero. The other entries are NoTangent() since f, lmo, and x0 are not differentiated.

See Bilevel Optimization for a complete worked example with gradient descent on the outer problem.

Parametric oracle usage

When the constraint set depends on $\theta$, use a ParametricOracle:

using Marguerite, LinearAlgebra

f(x, θ) = 0.5 * dot(x, x) - dot(θ[1:2], x)
∇f!(g, x, θ) = (g .= x .- θ[1:2])

plmo = ParametricBox(θ -> fill(θ[3], 2), θ -> fill(θ[4], 2))
θ = [0.8, 0.2, 0.0, 1.0]
x, result = solve(f, plmo, [0.5, 0.5], θ; grad=∇f!, max_iters=5000, tol=1e-6)

The rrule for this signature computes $d\theta$ through both the objective and constraint parameters via KKT adjoint differentiation.

Full Jacobian

For computing the full $n \times m$ Jacobian $\partial x^* / \partial\theta$, use solution_jacobian or its in-place variant solution_jacobian!:

using Marguerite, LinearAlgebra

f(x, θ) = 0.5 * dot(x, x) - dot(θ, x)
∇f!(g, x, θ) = (g .= x .- θ)

θ = [0.8, 0.6, 0.4, 0.2, 0.1]
x0 = fill(0.2, 5)
J, result = solution_jacobian(f, ProbSimplex(), x0, θ; grad=∇f!)

This is much faster than computing $n$ separate pullback calls because it forms the reduced Hessian explicitly ($n_{\text{free}}$ HVPs), Cholesky-factors it once, and solves all $m$ right-hand sides in a single backsubstitution.

Note

solution_jacobian does not yet support ParametricOracle (constraint sensitivity is not included in the batched computation). For constraint-parametric problems, use the rrule pullback instead.

For repeated calls (e.g., inside an optimization loop), pre-allocate the output matrix and use solution_jacobian!:

J = zeros(5, 5)
J, result = solution_jacobian!(J, f, ProbSimplex(), x0, θ; grad=∇f!)

See solution_jacobian and solution_jacobian! in the API Reference.

rrule

ChainRulesCore.rruleMethod

Implicit differentiation rule for solve(f, lmo, x0, θ; ...).

Handles all lmo types (plain functions, AbstractOracle, ParametricOracle) and both manual and auto gradient via the grad= keyword.

Uses _kkt_adjoint_solve_cached with precomputed _PullbackState so that expensive setup (active set identification, HVP preparation, constraint orthogonalization, buffer allocation) is performed once and amortized across multiple pullback calls (e.g. when computing a full Jacobian).

See Implicit Differentiation for the full mathematical derivation.

source