Bilevel Optimization
Why bilevel optimization?
Many machine learning and engineering problems have nested structure: an outer problem (learning, design) wraps an inner constrained optimization:
\[\min_\theta \; L\bigl(x^*(\theta)\bigr), \quad \text{where} \quad x^*(\theta) = \arg\min_{x \in \mathcal{C}} f(x;\, \theta)\]
Examples include hyperparameter tuning, meta-learning, adversarial training, and inverse optimization. Projects like Meta's Theseus and cvxpylayers have shown that differentiable optimization layers are a powerful primitive.
Marguerite supports constrained bilevel problems via Frank-Wolfe's projection-free approach: any set with a linear minimization oracle works, no projection operator needed.
How it works
Marguerite uses implicit differentiation – it does not unroll the solver iterations. At convergence, the optimality condition $\nabla_x f(x^*;\, \theta) \approx 0$ on the optimal face gives (via the implicit function theorem):
\[d\theta = -\left(\frac{\partial \nabla_x f}{\partial \theta}\right)^\top u, \quad [\nabla^2_{xx} f + \lambda I]\, u = dx\]
The Hessian system is solved by conjugate gradient with Hessian-vector products. This gives:
- No graph storage – implicit differentiation needs only the converged solution, not the full iteration history
- Exact gradients at convergence – not a truncated unrolling approximation
- Backend-agnostic – works with any DifferentiationInterface backend
The inner solver's monotonic=true flag ensures monotonic descent within the inner Frank-Wolfe loop only. The outer bilevel gradient descent loop needs its own step size control — a fixed learning rate can overshoot and increase the outer loss. The examples below use Armijo backtracking line search (halving η until the outer loss decreases) to guarantee monotonic convergence. In practice, any standard optimizer (Adam, L-BFGS, etc.) can be used instead.
High-level API
Marguerite provides bilevel_solve and bilevel_gradient for one-call bilevel optimization. These handle the forward solve, outer loss gradient, and implicit pullback internally:
using Marguerite, LinearAlgebra
n = 3
H = Diagonal([2.0, 1.5, 1.0])
f(x, θ) = 0.5 * dot(x, H * x) - dot(θ, x)
∇f!(g, x, θ) = (g .= H * x .- θ)
lmo = ProbSimplex()
x0 = fill(1.0 / n, n)
x_target = zeros(n)
x_target[1] = 0.6; x_target[2] = 0.3; x_target[3] = 0.1
outer_loss(x) = sum((x .- x_target).^2)
θ = H * x_target .* rand(n)
η = 0.02
losses = Float64[]
x_curr = copy(x0)
for k in 1:200
x_star, θ_grad, _ = bilevel_solve(outer_loss, f, lmo, x_curr, θ;
grad=∇f!, max_iters=15_000)
x_curr .= x_star
loss_k = outer_loss(x_star)
push!(losses, loss_k)
# Armijo backtracking on the outer step
η_k = η
for _ in 1:10
θ_cand = θ .- η_k .* θ_grad
x_cand, _ = solve(f, lmo, x_curr, θ_cand; grad=∇f!)
outer_loss(x_cand) ≤ loss_k && break
η_k *= 0.5
end
θ .= θ .- η_k .* θ_grad
end
x_final, _ = solve(f, lmo, x_curr, θ; grad=∇f!)
println("Final loss: ", round(losses[end]; sigdigits=3))
println("x*(θ): ", round.(x_final; digits=3))
println("x_target: ", x_target)Final loss: 0.000348
x*(θ): [0.586, 0.31, 0.104]
x_target: [0.6, 0.3, 0.1]using UnicodePlots
scatterplot(1:200, log10.(losses);
title="Outer Loss (log₁₀)",
xlabel="outer iteration", ylabel="log₁₀(loss)",
name="loss", width=60) ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Outer Loss (log₁₀)⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
┌────────────────────────────────────────────────────────────┐
-1 │⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ loss
│⠀⠉⠒⠦⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠈⠙⠲⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠲⠤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠙⠒⠤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠒⠤⢄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠓⠢⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
log₁₀(loss) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠲⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠒⠦⢤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠢⠤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠒⠦⢄⣀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠲⠤⣄⡀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
-4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀200⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀outer iteration⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ For just the gradient (without the solution), use bilevel_gradient:
θ_grad = bilevel_gradient(outer_loss, f, lmo, x0, θ;
grad=∇f!)Both functions accept diff_cg_maxiter, diff_cg_tol, and diff_lambda to tune the CG solver used in the implicit differentiation backward pass. See Implicit Differentiation for details.
Advanced: manual rrule
For full control, call the rrule directly. This is useful when your outer loss depends on θ directly (not just through x*(θ)), or when you need access to the inner Result diagnostics.
Inner problem: minimize a parameterized quadratic on the probability simplex.
\[x^*(\theta) = \arg\min_{x \in \Delta_n} \;\tfrac{1}{2} x^\top H x - \theta^\top x\]
Outer problem: find $\theta$ such that $x^*(\theta)$ matches a target.
\[\min_\theta \; \|x^*(\theta) - x_{\text{target}}\|^2\]
using ChainRulesCore: rrule
solve_kw = (;)The key pattern: call rrule directly to get the pullback, then compute the outer loss gradient:
function bilevel_step(x_curr, θ)
(x_star, result), pb = rrule(solve, f, lmo, x_curr, θ; grad=∇f!, solve_kw...)
loss = sum((x_star .- x_target).^2)
dx = 2.0 .* (x_star .- x_target)
tangents = pb((dx, nothing))
return x_star, loss, tangents[end] # (x*, L, ∂L/∂θ)
endRun gradient descent on the outer problem:
θ = H * x_target .* rand(n)
η = 0.02
losses = Float64[]
x_curr = copy(x0)
for k in 1:200
x_star, loss, dθ = bilevel_step(x_curr, θ)
x_curr .= x_star
push!(losses, loss)
# Armijo backtracking on the outer step
η_k = η
for _ in 1:10
θ_cand = θ .- η_k .* dθ
x_cand, _ = solve(f, lmo, x_curr, θ_cand; grad=∇f!, solve_kw...)
outer_loss(x_cand) ≤ loss && break
η_k *= 0.5
end
θ .= θ .- η_k .* dθ
end
x_final, _ = solve(f, lmo, x_curr, θ; grad=∇f!, solve_kw...)
println("Final loss: ", round(losses[end]; sigdigits=3))
println("x*(θ): ", round.(x_final; digits=3))
println("x_target: ", x_target)┌ Warning: rrule(solve): solver gap (0.00013629227018575885) >> active set tolerance (1.0e-6); differentiation may be inaccurate. Consider tightening tol.
└ @ Marguerite ~/work/Marguerite.jl/Marguerite.jl/src/diff_rules.jl:298
┌ Warning: rrule(solve): solver gap (0.00014860491986475044) >> active set tolerance (1.0e-6); differentiation may be inaccurate. Consider tightening tol.
└ @ Marguerite ~/work/Marguerite.jl/Marguerite.jl/src/diff_rules.jl:298
┌ Warning: rrule(solve): solver gap (0.00016073051544589212) >> active set tolerance (1.0e-6); differentiation may be inaccurate. Consider tightening tol.
└ @ Marguerite ~/work/Marguerite.jl/Marguerite.jl/src/diff_rules.jl:298
Final loss: 9.18e-6
x*(θ): [0.598, 0.302, 0.1]
x_target: [0.6, 0.3, 0.1]using UnicodePlots
scatterplot(1:200, log10.(losses);
title="Outer Loss (log₁₀)",
xlabel="outer iteration", ylabel="log₁₀(loss)",
name="loss", width=60) ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Outer Loss (log₁₀)⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀
┌────────────────────────────────────────────────────────────┐
-2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ loss
│⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠉⠑⠲⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠈⠉⠒⠲⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠒⠲⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠒⠤⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠒⠢⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
log₁₀(loss) │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠓⠒⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠙⠒⠤⢤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠒⠢⠤⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠙⠢⠤⠤⣀⡀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠒│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
│⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
-6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│
└────────────────────────────────────────────────────────────┘
⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀200⠀
⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀outer iteration⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀ Parametric constraint sets
When the constraint set itself depends on $\theta$, use a ParametricOracle instead of a plain oracle. This enables differentiation through both the objective and the constraints:
\[\min_\theta \; L(x^*(\theta)), \quad x^*(\theta) = \arg\min_{x \in C(\theta)} f(x, \theta)\]
using Marguerite, LinearAlgebra
n = 3
outer_loss(x) = sum((x .- [0.3, 0.5, 0.2]).^2)
x0 = fill(1.0 / n, n)
theta = [0.3, 0.5, 0.2, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0]
f(x, θ) = 0.5 * dot(x, x) - dot(θ[1:n], x)
∇f!(g, x, θ) = (g .= x .- θ[1:n])
# Box constraints with θ-dependent bounds
plmo = ParametricBox(θ -> fill(θ[n+1], n), θ -> fill(θ[n+2], n))
x_star, θ_grad, cg_result = bilevel_solve(outer_loss, f, plmo, x0, theta;
grad=∇f!)
# x_star ≈ [0.3, 0.5, 0.2], θ_grad has 9 components (objective + box bounds)This snippet shows the API pattern only. See High-level API above for a complete bilevel loop with convergence output.
The gradient $d\theta$ accounts for both how $\theta$ affects the objective and how it shifts the constraint boundaries, computed via KKT adjoint differentiation. See Implicit Differentiation for the mathematical details.
Why Frank-Wolfe for bilevel?
Frank-Wolfe has properties that suit bilevel optimization with complex constraints:
- Projection-free: Only needs a linear minimization oracle, not a projection. Many constraint sets (matroid polytopes, flow polytopes, nuclear norm balls) have cheap LMOs but expensive projections.
- Sparse iterates: Solutions are convex combinations of vertices, giving interpretable sparse structure.
- Theoretical guarantees: Palmieri et al. (2026) establish iteration complexity bounds for Frank-Wolfe in bilevel settings.
References
- A. Palmieri, F. Rinaldi, S. Salzo & S. Venturini, "Iteration Complexity of Frank-Wolfe and Its Variants for Bilevel Optimization," 2026.
- R. Grazzi, L. Franceschi, M. Pontil & S. Salzo, "On the Iteration Complexity of Hypergradient Computation," ICML, 2020.
- L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi & M. Pontil, "Bilevel Programming for Hyperparameter Optimization and Meta-Learning," ICML, 2018.
- A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond & Z. Kolter, "Differentiable Convex Optimization Layers," NeurIPS, 2019.