Bilevel Optimization

Why bilevel optimization?

Many machine learning and engineering problems have nested structure: an outer problem (learning, design) wraps an inner constrained optimization:

\[\min_\theta \; L\bigl(x^*(\theta)\bigr), \quad \text{where} \quad x^*(\theta) = \arg\min_{x \in \mathcal{C}} f(x;\, \theta)\]

Examples include hyperparameter tuning, meta-learning, adversarial training, and inverse optimization. Projects like Meta's Theseus and cvxpylayers have shown that differentiable optimization layers are a powerful primitive.

Marguerite supports constrained bilevel problems via Frank-Wolfe's projection-free approach: any set with a linear minimization oracle works, no projection operator needed.

How it works

Marguerite uses implicit differentiation – it does not unroll the solver iterations. At convergence, the optimality condition $\nabla_x f(x^*;\, \theta) \approx 0$ on the optimal face gives (via the implicit function theorem):

\[d\theta = -\left(\frac{\partial \nabla_x f}{\partial \theta}\right)^\top u, \quad [\nabla^2_{xx} f + \lambda I]\, u = dx\]

The Hessian system is solved by conjugate gradient with Hessian-vector products. This gives:

No graph storage – implicit differentiation needs only the converged solution, not the full iteration history
Exact gradients at convergence – not a truncated unrolling approximation
Backend-agnostic – works with any DifferentiationInterface backend

Step size control in the outer loop

The inner solver's monotonic=true flag ensures monotonic descent within the inner Frank-Wolfe loop only. The outer bilevel gradient descent loop needs its own step size control — a fixed learning rate can overshoot and increase the outer loss. The examples below use Armijo backtracking line search (halving η until the outer loss decreases) to guarantee monotonic convergence. In practice, any standard optimizer (Adam, L-BFGS, etc.) can be used instead.

High-level API

Marguerite provides bilevel_solve and bilevel_gradient for one-call bilevel optimization. These handle the forward solve, outer loss gradient, and implicit pullback internally:

using Marguerite, LinearAlgebra

n = 3
H = Diagonal([2.0, 1.5, 1.0])

f(x, θ) = 0.5 * dot(x, H * x) - dot(θ, x)
∇f!(g, x, θ) = (g .= H * x .- θ)

lmo = ProbSimplex()
x0 = fill(1.0 / n, n)

x_target = zeros(n)
x_target[1] = 0.6; x_target[2] = 0.3; x_target[3] = 0.1

outer_loss(x) = sum((x .- x_target).^2)
θ = H * x_target .* rand(n)
η = 0.02

losses = Float64[]
x_curr = copy(x0)
for k in 1:200
    x_star, θ_grad, _ = bilevel_solve(outer_loss, f, lmo, x_curr, θ;
                                       grad=∇f!, max_iters=15_000)
    x_curr .= x_star
    loss_k = outer_loss(x_star)
    push!(losses, loss_k)

    # Armijo backtracking on the outer step
    η_k = η
    for _ in 1:10
        θ_cand = θ .- η_k .* θ_grad
        x_cand, _ = solve(f, lmo, x_curr, θ_cand; grad=∇f!)
        outer_loss(x_cand) ≤ loss_k && break
        η_k *= 0.5
    end
    θ .= θ .- η_k .* θ_grad
end

x_final, _ = solve(f, lmo, x_curr, θ; grad=∇f!)
println("Final loss: ", round(losses[end]; sigdigits=3))
println("x*(θ):     ", round.(x_final; digits=3))
println("x_target:  ", x_target)

Final loss: 0.000348
x*(θ):     [0.586, 0.31, 0.104]
x_target:  [0.6, 0.3, 0.1]

using UnicodePlots
scatterplot(1:200, log10.(losses);
         title="Outer Loss (log₁₀)",
         xlabel="outer iteration", ylabel="log₁₀(loss)",
         name="loss", width=60)

                  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Outer Loss (log₁₀)⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀     
                  ┌────────────────────────────────────────────────────────────┐     
               -1 │⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ loss
                  │⠀⠉⠒⠦⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠈⠙⠲⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠑⠲⠤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠙⠒⠤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠒⠤⢄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠓⠢⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
   log₁₀(loss)    │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠲⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠒⠦⢤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠢⠤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠒⠦⢄⣀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠲⠤⣄⡀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
               -4 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  └────────────────────────────────────────────────────────────┘     
                  ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀200⠀     
                  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀outer iteration⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

For just the gradient (without the solution), use bilevel_gradient:

θ_grad = bilevel_gradient(outer_loss, f, lmo, x0, θ;
                          grad=∇f!)

Both functions accept diff_cg_maxiter, diff_cg_tol, and diff_lambda to tune the CG solver used in the implicit differentiation backward pass. See Implicit Differentiation for details.

Advanced: manual rrule

For full control, call the rrule directly. This is useful when your outer loss depends on θ directly (not just through x*(θ)), or when you need access to the inner Result diagnostics.

Inner problem: minimize a parameterized quadratic on the probability simplex.

\[x^*(\theta) = \arg\min_{x \in \Delta_n} \;\tfrac{1}{2} x^\top H x - \theta^\top x\]

Outer problem: find $\theta$ such that $x^*(\theta)$ matches a target.

\[\min_\theta \; \|x^*(\theta) - x_{\text{target}}\|^2\]

using ChainRulesCore: rrule

solve_kw = (;)

The key pattern: call rrule directly to get the pullback, then compute the outer loss gradient:

function bilevel_step(x_curr, θ)
    (x_star, result), pb = rrule(solve, f, lmo, x_curr, θ; grad=∇f!, solve_kw...)
    loss = sum((x_star .- x_target).^2)
    dx = 2.0 .* (x_star .- x_target)
    tangents = pb((dx, nothing))
    return x_star, loss, tangents[end]  # (x*, L, ∂L/∂θ)
end

Run gradient descent on the outer problem:

θ = H * x_target .* rand(n)
η = 0.02

losses = Float64[]
x_curr = copy(x0)
for k in 1:200
    x_star, loss, dθ = bilevel_step(x_curr, θ)
    x_curr .= x_star
    push!(losses, loss)

    # Armijo backtracking on the outer step
    η_k = η
    for _ in 1:10
        θ_cand = θ .- η_k .* dθ
        x_cand, _ = solve(f, lmo, x_curr, θ_cand; grad=∇f!, solve_kw...)
        outer_loss(x_cand) ≤ loss && break
        η_k *= 0.5
    end
    θ .= θ .- η_k .* dθ
end

x_final, _ = solve(f, lmo, x_curr, θ; grad=∇f!, solve_kw...)
println("Final loss: ", round(losses[end]; sigdigits=3))
println("x*(θ):     ", round.(x_final; digits=3))
println("x_target:  ", x_target)

┌ Warning: rrule(solve): solver gap (0.00013629227018575885) >> active set tolerance (1.0e-6); differentiation may be inaccurate. Consider tightening tol.
└ @ Marguerite ~/work/Marguerite.jl/Marguerite.jl/src/diff_rules.jl:298
┌ Warning: rrule(solve): solver gap (0.00014860491986475044) >> active set tolerance (1.0e-6); differentiation may be inaccurate. Consider tightening tol.
└ @ Marguerite ~/work/Marguerite.jl/Marguerite.jl/src/diff_rules.jl:298
┌ Warning: rrule(solve): solver gap (0.00016073051544589212) >> active set tolerance (1.0e-6); differentiation may be inaccurate. Consider tightening tol.
└ @ Marguerite ~/work/Marguerite.jl/Marguerite.jl/src/diff_rules.jl:298
Final loss: 9.18e-6
x*(θ):     [0.598, 0.302, 0.1]
x_target:  [0.6, 0.3, 0.1]

using UnicodePlots
scatterplot(1:200, log10.(losses);
         title="Outer Loss (log₁₀)",
         xlabel="outer iteration", ylabel="log₁₀(loss)",
         name="loss", width=60)

                  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀Outer Loss (log₁₀)⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀     
                  ┌────────────────────────────────────────────────────────────┐     
               -2 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│ loss
                  │⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠉⠑⠲⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠈⠉⠒⠲⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠒⠲⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠓⠒⠤⠤⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠒⠢⠤⣄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
   log₁₀(loss)    │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠓⠒⠤⣄⣀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠙⠒⠤⢤⣀⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠙⠒⠢⠤⢄⡀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠈⠉⠙⠢⠤⠤⣀⡀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠉⠑⠒│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
               -6 │⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀│     
                  └────────────────────────────────────────────────────────────┘     
                  ⠀0⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀200⠀     
                  ⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀outer iteration⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀⠀

Parametric constraint sets

When the constraint set itself depends on $\theta$, use a ParametricOracle instead of a plain oracle. This enables differentiation through both the objective and the constraints:

\[\min_\theta \; L(x^*(\theta)), \quad x^*(\theta) = \arg\min_{x \in C(\theta)} f(x, \theta)\]

using Marguerite, LinearAlgebra

n = 3
outer_loss(x) = sum((x .- [0.3, 0.5, 0.2]).^2)
x0 = fill(1.0 / n, n)
theta = [0.3, 0.5, 0.2, 0.0, 0.0, 0.0, 1.0, 1.0, 1.0]

f(x, θ) = 0.5 * dot(x, x) - dot(θ[1:n], x)
∇f!(g, x, θ) = (g .= x .- θ[1:n])

# Box constraints with θ-dependent bounds
plmo = ParametricBox(θ -> fill(θ[n+1], n), θ -> fill(θ[n+2], n))

x_star, θ_grad, cg_result = bilevel_solve(outer_loss, f, plmo, x0, theta;
                                           grad=∇f!)
# x_star ≈ [0.3, 0.5, 0.2], θ_grad has 9 components (objective + box bounds)

Note

This snippet shows the API pattern only. See High-level API above for a complete bilevel loop with convergence output.

The gradient $d\theta$ accounts for both how $\theta$ affects the objective and how it shifts the constraint boundaries, computed via KKT adjoint differentiation. See Implicit Differentiation for the mathematical details.

Why Frank-Wolfe for bilevel?

Frank-Wolfe has properties that suit bilevel optimization with complex constraints:

Projection-free: Only needs a linear minimization oracle, not a projection. Many constraint sets (matroid polytopes, flow polytopes, nuclear norm balls) have cheap LMOs but expensive projections.
Sparse iterates: Solutions are convex combinations of vertices, giving interpretable sparse structure.
Theoretical guarantees: Palmieri et al. (2026) establish iteration complexity bounds for Frank-Wolfe in bilevel settings.

References

A. Palmieri, F. Rinaldi, S. Salzo & S. Venturini, "Iteration Complexity of Frank-Wolfe and Its Variants for Bilevel Optimization," 2026.
R. Grazzi, L. Franceschi, M. Pontil & S. Salzo, "On the Iteration Complexity of Hypergradient Computation," ICML, 2020.
L. Franceschi, P. Frasconi, S. Salzo, R. Grazzi & M. Pontil, "Bilevel Programming for Hyperparameter Optimization and Meta-Learning," ICML, 2018.
A. Agrawal, B. Amos, S. Barratt, S. Boyd, S. Diamond & Z. Kolter, "Differentiable Convex Optimization Layers," NeurIPS, 2019.