ICLR · 2026

In-Context Multi-Objective Optimization

A dimension-agnostic, fully amortized policy for multi-objective black-box optimization — replaces the surrogate-plus-acquisition stack with a single forward pass.

Xinyu Zhang1,2 Conor Hassan1,2 Julien Martinelli1,2 Daolang Huang1,2 Samuel Kaski1,2,3
1ELLIS Institute Finland 2Aalto University, Finland 3University of Manchester, UK
Scroll

Abstract

Balancing competing objectives is omnipresent across science and engineering — drug efficacy versus toxicity, model accuracy versus latency, material strength versus weight. Multi-objective Bayesian optimization (MOBO) is the standard approach: fit a probabilistic surrogate to each objective, then optimize an acquisition function to select the next design. In practice, however, surrogate and acquisition choices rarely transfer across problems, and the per-step refitting and re-optimization overhead accumulates quickly in parallel or time-sensitive loops.

We present TAMO, a transformer that replaces the entire surrogate-plus-acquisition stack with a single forward pass. By operating directly over the optimization history — without fitting any task-specific model - TAMO can be pretrained once on diverse synthetic tasks and transferred to new problems in any input or objective dimensionality, with no retraining required.

Across synthetic benchmarks and real-world tasks, TAMO reduces proposal time by 50–1000× versus GP-based MOBO while matching or improving Pareto quality under tight evaluation budgets. These results demonstrate that in-context learning is a viable path toward plug-and-play multi-objective optimization — removing the need for per-task surrogate engineering without sacrificing solution quality.

Limitations of Classical MOBO

Real-world design problems rarely have a single number to maximize. The answer is not a single point but a Pareto front: the set of designs where no objective can be improved without hurting another. The goal of multi-objective optimization is to approximate this front with as few costly evaluations as possible.

To measure progress toward this front, we use the hypervolume (HV) indicator: for a reference point \(r \in \mathbb R^{d_y}\), \(\text{HV}(\mathcal{P}(\mathcal{X}) \mid r)\) measures how much of the objective space between \(r\) and the approximated frontier \(\mathcal{P}(\mathcal{X})\) is covered by Pareto-optimal points.

Hypervolume and Pareto front A bi-objective minimization plot showing a reference point, a Pareto front of five non-dominated solutions, the dominated hypervolume region, and several dominated points. f₁ f₂ ref. point r Hypervolume Pareto front dominates
Figure 1. A Pareto front in two objectives. The goal is to approximate the dashed curve using as few oracle calls as possible, since each evaluation can be slow or expensive.

Classical MOBO refits a statistical surrogate — typically a Gaussian process (GP) — at every step and re-optimizes an acquisition function (AF). It works, but:

The question this paper asks is simple: what if a single pretrained model could replace both, for any problem, in any dimensionality?

TAMO: Task-Agnostic Amortized Multi-Objective Optimization

TAMO is a transformer-based framework trained on a large, diverse corpus of synthetic optimization tasks. At test time, it conditions on the optimization history and suggests the next query in a single forward pass.

TAMO conceptual workflow
Figure 2. Conceptual workflow of TAMO. A single pretrained transformer replaces the entire surrogate-plus-acquisition stack of classical MOBO.
Method Multi-objective End-to-end amortized Input agnostic Output agnostic
Vanilla MOBO
BOFormer
NAP
DANP
TAMO (ours)

Architecture

The key enabling component is a dimension-agnostic embedder: each observation \((\mathbf{x}, \mathbf{y})\) - regardless of how many input features or objectives it has — is projected to a fixed-size vector by applying scalar-to-vector maps dimension-wise, interacting via attention, adding learnable positional tokens, and mean-pooling. This lets the same backbone process tasks with different \(d_x\) and \(d_y\) without any architectural changes.

TAMO architecture diagram
Figure 3. Detailed architecture of TAMO, showing the dimension-agnostic embedder, shared transformer backbone, and the two training heads (prediction and optimization).

Pretraining

TAMO is pretrained offline on a distribution of synthetic multi-objective tasks drawn from Gaussian process priors with varied kernels, length scales, and input/output dimensionalities. Because the embedder is dimension-agnostic, a single pretraining run can draw tasks with different numbers of features and objectives.

Two training signals share the same transformer backbone:

Policy head & RL objective

The policy head assigns each candidate \(\mathbf{x}_i^{q}\) a scalar logit \(\alpha_i := \mathrm{MLP}_\theta(\hat{\mathbf{E}}_i^{q})\) and converts them to a categorical distribution over the query set:

\[ \pi_\theta\!\bigl(\mathbf{x}_i^{q}\mid t, T, \mathcal{H}_{1:t-1}\bigr) \;=\; \frac{\exp(\alpha_i)}{\displaystyle\sum_{r=1}^{N_q}\exp(\alpha_r)}. \]

At each step \(t\), observing \(\mathbf{y}_t\) yields a reward equal to the normalized hypervolume level, \(r_t = \mathrm{HV}(\mathcal{P}(\mathcal{D}^h)\mid\mathbf{r})\,/\,\mathrm{HV}^\star_\tau\). The policy maximizes the expected discounted return over full trajectories,

\[ J(\theta) \;=\; \mathbb{E}_{\tau\sim p(\tau)}\!\left[ \mathbb{E}_{\pi_\theta}\!\left[\sum_{t=1}^{T} \gamma^{t-1} r_t\right] \right], \]

estimated with REINFORCE. Training on complete trajectories encourages non-myopic, long-horizon Pareto improvement — in contrast to standard one-step acquisitions.

Prediction head & auxiliary objective

The prediction head models each scalar objective \(y_{i,k}^{p}\) as a \(K\)-component Gaussian mixture conditioned on a context set \(\mathcal{D}^c\):

\[ p\!\left(y_{i,k}^{p}\mid\mathbf{x}_i^{p},\mathcal{D}^{c}\right) \;=\; \sum_{\ell=1}^{K} \phi_{i\ell}\,\mathcal{N}\!\bigl(y_{i,k}^{p};\,\mu_{i\ell},\,\sigma_{i\ell}^{2}\bigr). \]

This in-context regression task is trained by minimizing the negative log-likelihood:

\[ \mathcal{L}^{(p)}(\theta) \;=\; -\,\mathbb{E}_{\tau\sim p(\tau)}\!\left[ \frac{1}{N_p\, d_y^\tau} \sum_{i=1}^{N_p}\sum_{k=1}^{d_y^\tau} \log p\!\left(y_{i,k}^{p}\mid\mathbf{x}_i^{p},\mathcal{D}^{c}\right) \right]. \]

This auxiliary task regularizes the shared representations and stabilizes policy learning.

Combined objective

The two losses are combined into a single objective, optimized jointly after an initial prediction warm-up phase:

\[ \mathcal{L}(\theta) \;=\; \lambda_p\,\mathcal{L}^{(p)}(\theta) + \mathcal{L}^{(\mathrm{rl})}(\theta), \qquad \mathcal{L}^{(\mathrm{rl})}(\theta) = -J(\theta), \]

with \(\lambda_p = 1.0\) fixed across all experiments.

Results

We evaluate TAMO on synthetic and real-world multi-objective benchmarks against GP-based MOBO baselines (qNEHVI, qNParEGO, qHVKG) and the amortized baseline BOFormer.

We report simple regret — how far the hypervolume of the approximated Pareto front is from the maximal hypervolume (lower is better) — and cumulative wall-clock proposal time, which for GP-based methods includes surrogate refitting and acquisition optimization at every step.

Results on synthetic and real-world benchmarks
Figure 4. Synthetic and real-world multi-objective benchmarks: simple regret (top) and cumulative inference time (bottom) vs. oracle calls (mean ± 95% CI over 30 runs). TAMO achieves competitive regret while cutting proposal time by 50–1000×.
Results on HPO-3DGS hyperparameter optimization tasks
Figure 5. HPO-3DGS hyperparameter optimization tasks. TAMO is competitive with the best-performing GP-based methods (with qNEHVI slightly ahead on Mic), while achieving significantly lower inference time.

Generalization

One interesting result is that one pretrained model transfers across heterogeneous problem structures with no retraining — both to unseen input dimensionalities and to settings where each objective is observed independently.

Results on out-of-distribution dimensions
Figure 6. Transfer to out-of-distribution input dimensions.
Results on independently observed objectives
Figure 7. Transfer to independently observed objectives.

Takeaways

AspectFinding
Pareto quality Matches or beats specialized GP baselines and amortized methods on simple regret under tight evaluation budget.
Wall-clock speed 50–1000× lower proposal time than GP baselines and amortized methods.
Synthetic-to-real transfer Despite training only on synthetic GP priors, TAMO transfers competitively to real-world tasks — no real-data pretraining required.
Generalization One pretrained model across input/output dimensions and function families.
Single-objective BO Applies effortlessly to single-objective optimization, yielding competitive results against GP baselines.

Scope & Limitations

TAMO is pretrained entirely on synthetic GP samples, which provides scale and control but may miss salient real-world structure. Performance is likely sensitive to the GP kernel family and smoothness used during pretraining. A systematic study of how corpus composition drives transfer is an important open direction.

Inference currently assumes a discrete candidate pool, which aligns well with discrete or low-dimensional design problems, but is restrictive in high-dimensional or generative settings, such as de novo molecular design. Extending TAMO to continuous proposal mechanisms (e.g., diffusion-based or autoregressive generators) is a natural next step.

BibTeX

@inproceedings{
zhang2026incontext,
title={In-Context Multi-Objective Optimization},
author={Xinyu Zhang and Conor Hassan and Julien Martinelli and Daolang Huang and Samuel Kaski},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=odmeUlWta8}
}