Implicit Differentiation
Theory
At convergence, $x^*(\theta)$ satisfies the first-order optimality condition
\[\nabla_x f(x^*; \theta) = 0\]
on the optimal face. By the implicit function theorem:
\[\frac{\partial x^*}{\partial \theta} = -[\nabla^2_{xx} f]^{-1} \nabla^2_{x\theta} f\]
For the pullback (reverse mode), given cotangent $dx$:
\[d\theta = -\left(\frac{\partial \nabla_x f}{\partial \theta}\right)^\top u, \quad \text{where } \nabla^2_{xx} f \cdot u = dx\]
Implementation
The CG linear solve and cross-derivative computation use DifferentiationInterface:
| Operation | Map | Best mode | DI function |
|---|---|---|---|
| Gradient of $f$ (auto) | $\mathbb{R}^n \to \mathbb{R}$ | Forward | DI.gradient! |
| HVP in CG | JVP of $\mathbb{R}^n \to \mathbb{R}^n$ | Forward | DI.hvp |
| Cross-derivative | gradient of $\theta \mapsto \langle \nabla_x f, u \rangle$ | Forward or Reverse | DI.gradient |
The Hessian system is solved by conjugate gradient (CG) with Hessian-vector products, avoiding explicit Hessian construction. Tikhonov regularization $(\nabla^2_{xx} f + \lambda I)$ ensures well-conditioned systems near singular Hessians.
KKT Adjoint for Constrained Solutions
When the solution $x^*$ lies on the boundary of $\mathcal{C}$ (i.e., some constraints are active), the unconstrained optimality condition $\nabla_x f = 0$ no longer holds. Marguerite automatically detects active constraints via active_set and solves the full KKT adjoint system:
\[\begin{bmatrix} \nabla^2_{xx} f & G^\top \\ G & 0 \end{bmatrix} \begin{bmatrix} u \\ \mu \end{bmatrix} = \begin{bmatrix} dx \\ 0 \end{bmatrix}\]
where $G$ is the matrix of active constraint normals. This is solved via a reduced-space approach:
- Partition variables into bound (pinned to constraint boundaries) and free
- Project $dx_{\text{free}}$ onto the null space of equality constraint normals
- CG solve in the reduced space:
\[(P H_{\text{free}} P + \lambda I)\, w = P dx_{\text{free}}\]
- Recover multipliers $\mu$ from the KKT residual
For interior solutions (no active constraints), this reduces to the unconstrained Hessian solve described above.
For custom oracles, Marguerite needs an active_set specialization to know whether the solution lies on a boundary face. If none is defined, differentiated calls fail by default with an actionable error. You can pass assume_interior=true to override this and use the interior approximation, but that is only appropriate when the solution is genuinely interior.
Parametric Constraints
When using a ParametricOracle, the constraint set itself depends on $\theta$. The total gradient has two components:
\[d\theta = d\theta_{\text{obj}} + d\theta_{\text{constraint}}\]
For an active linear face $A(\theta)x = b(\theta)$, the reverse mode pullback is
\[d\theta = -u^\top \partial_\theta \nabla_x f - \lambda^\top (\partial_\theta A)\, u - \mu^\top \bigl((\partial_\theta A)\,x^* - \partial_\theta b\bigr),\]
where $u$ and $\mu$ come from the KKT adjoint solve above and $\lambda$ are the primal active-face multipliers recovered from stationarity.
For simple RHS-parametric constraints this reduces to $d\theta_{\text{constraint}} = \nabla_\theta(\mu^\top h(\theta))$. For constraints with $\theta$-dependent normals (for example ParametricWeightedSimplex with varying $\alpha(\theta)$), the $-\lambda^\top (\partial_\theta A)\,u$ term is also required.
Usage
using Marguerite, LinearAlgebra
f(x, θ) = 0.5 * dot(x, x) - dot(θ, x)
∇f!(g, x, θ) = (g .= x .- θ)
θ = [0.8, 0.2]
x, result = solve(f, ProbSimplex(), [0.5, 0.5], θ; grad=∇f!)The ChainRulesCore.rrule is defined on the 5-argument solve signatures (those accepting θ). The rrule pullback computes Hessian-vector products internally using SECOND_ORDER_BACKEND (forward-over-forward via AutoForwardDiff).
AD backend selection
Marguerite uses ForwardDiff as the default backend. All AD goes through DifferentiationInterface, so you can override with any DI-compatible backend:
import DifferentiationInterface as DI
x, result = solve(f, ProbSimplex(), [0.5, 0.5];
backend=DI.AutoForwardDiff())Tuning the linear solver
The implicit differentiation backward pass solves a linear system $(\nabla^2_{xx} f + \lambda I) u = dx$ on the reduced (active-face) space.
The rrule and solution_jacobian use a direct Cholesky factorization of the reduced Hessian (falling back to LU if not positive definite). The only tuning parameter is diff_lambda:
| Keyword | Default | Description |
|---|---|---|
diff_lambda | 1e-4 | Tikhonov regularization strength |
bilevel_solve and bilevel_gradient use iterative CG instead, and accept two additional keywords:
| Keyword | Default | Description |
|---|---|---|
diff_cg_maxiter | 50 | Maximum CG iterations |
diff_cg_tol | 1e-6 | CG convergence tolerance on residual norm |
All three can be passed to solve (θ-accepting variants), bilevel_solve, bilevel_gradient, or directly to rrule:
x, result = solve(f, lmo, x0, θ;
grad=∇f!, diff_lambda=1e-3)Bilevel optimization via rrule
For bilevel problems, call the rrule directly to get the pullback. The default backends handle everything automatically — ForwardDiff for gradients and SECOND_ORDER_BACKEND (forward-over-forward) for HVPs:
using ChainRulesCore: rrule
(x_star, result), pb = rrule(solve, f, lmo, x0, θ;
grad=∇f!, max_iters=5000)The pullback accepts a tuple (dx, result_tangent) where dx is the cotangent of the solution and result_tangent is typically nothing:
tangents = pb((dx, nothing))
# tangents = (NoTangent(), NoTangent(), NoTangent(), NoTangent(), dθ)
# solve f lmo x0 θOnly dθ (the last element) is nonzero. The other entries are NoTangent() since f, lmo, and x0 are not differentiated.
See Bilevel Optimization for a complete worked example with gradient descent on the outer problem.
Parametric oracle usage
When the constraint set depends on $\theta$, use a ParametricOracle:
using Marguerite, LinearAlgebra
f(x, θ) = 0.5 * dot(x, x) - dot(θ[1:2], x)
∇f!(g, x, θ) = (g .= x .- θ[1:2])
plmo = ParametricBox(θ -> fill(θ[3], 2), θ -> fill(θ[4], 2))
θ = [0.8, 0.2, 0.0, 1.0]
x, result = solve(f, plmo, [0.5, 0.5], θ; grad=∇f!, max_iters=5000, tol=1e-6)The rrule for this signature computes $d\theta$ through both the objective and constraint parameters via KKT adjoint differentiation.
Full Jacobian
For computing the full $n \times m$ Jacobian $\partial x^* / \partial\theta$, use solution_jacobian or its in-place variant solution_jacobian!:
using Marguerite, LinearAlgebra
f(x, θ) = 0.5 * dot(x, x) - dot(θ, x)
∇f!(g, x, θ) = (g .= x .- θ)
θ = [0.8, 0.6, 0.4, 0.2, 0.1]
x0 = fill(0.2, 5)
J, result = solution_jacobian(f, ProbSimplex(), x0, θ; grad=∇f!)This is much faster than computing $n$ separate pullback calls because it forms the reduced Hessian explicitly ($n_{\text{free}}$ HVPs), Cholesky-factors it once, and solves all $m$ right-hand sides in a single backsubstitution.
solution_jacobian does not yet support ParametricOracle (constraint sensitivity is not included in the batched computation). For constraint-parametric problems, use the rrule pullback instead.
For repeated calls (e.g., inside an optimization loop), pre-allocate the output matrix and use solution_jacobian!:
J = zeros(5, 5)
J, result = solution_jacobian!(J, f, ProbSimplex(), x0, θ; grad=∇f!)See solution_jacobian and solution_jacobian! in the API Reference.
rrule
ChainRulesCore.rrule — Method
Implicit differentiation rule for solve(f, lmo, x0, θ; ...).
Handles all lmo types (plain functions, AbstractOracle, ParametricOracle) and both manual and auto gradient via the grad= keyword.
Uses _kkt_adjoint_solve_cached with precomputed _PullbackState so that expensive setup (active set identification, HVP preparation, constraint orthogonalization, buffer allocation) is performed once and amortized across multiple pullback calls (e.g. when computing a full Jacobian).
See Implicit Differentiation for the full mathematical derivation.