In-Context Multi-Objective Optimization

Abstract

Balancing competing objectives is omnipresent across science and engineering — drug efficacy versus toxicity, model accuracy versus latency, material strength versus weight. Multi-objective Bayesian optimization (MOBO) is the standard approach: fit a probabilistic surrogate to each objective, then optimize an acquisition function to select the next design. In practice, however, surrogate and acquisition choices rarely transfer across problems, and the per-step refitting and re-optimization overhead accumulates quickly in parallel or time-sensitive loops.

We present TAMO, a transformer that replaces the entire surrogate-plus-acquisition stack with a single forward pass. By operating directly over the optimization history — without fitting any task-specific model - TAMO can be pretrained once on diverse synthetic tasks and transferred to new problems in any input or objective dimensionality, with no retraining required.

Across synthetic benchmarks and real-world tasks, TAMO reduces proposal time by 50–1000× versus GP-based MOBO while matching or improving Pareto quality under tight evaluation budgets. These results demonstrate that in-context learning is a viable path toward plug-and-play multi-objective optimization — removing the need for per-task surrogate engineering without sacrificing solution quality.

Limitations of Classical MOBO

Real-world design problems rarely have a single number to maximize. The answer is not a single point but a Pareto front: the set of designs where no objective can be improved without hurting another. The goal of multi-objective optimization is to approximate this front with as few costly evaluations as possible.

To measure progress toward this front, we use the hypervolume (HV) indicator: for a reference point \(r \in \mathbb R^{d_y}\), \(\text{HV}(\mathcal{P}(\mathcal{X}) \mid r)\) measures how much of the objective space between \(r\) and the approximated frontier \(\mathcal{P}(\mathcal{X})\) is covered by Pareto-optimal points.

Figure 1. A Pareto front in two objectives. The goal is to approximate the dashed curve using as few oracle calls as possible, since each evaluation can be slow or expensive.

Classical MOBO refits a statistical surrogate — typically a Gaussian process (GP) — at every step and re-optimizes an acquisition function (AF). It works, but:

The choice of surrogate and acquisition rarely transfers to the next problem.
The per-step overhead from GP refitting and AF optimization scales quickly in parallel settings.
Most acquisitions are myopic: they optimize a one-step gain, which can miss designs that open up better regions of the Pareto front later.

The question this paper asks is simple: what if a single pretrained model could replace both, for any problem, in any dimensionality?

TAMO: Task-Agnostic Amortized Multi-Objective Optimization

TAMO is a transformer-based framework trained on a large, diverse corpus of synthetic optimization tasks. At test time, it conditions on the optimization history and suggests the next query in a single forward pass.

TAMO conceptual workflow — **Figure 2.** Conceptual workflow of TAMO. A single pretrained transformer replaces the entire surrogate-plus-acquisition stack of classical MOBO.

Method	Multi-objective	End-to-end amortized	Input agnostic	Output agnostic
Vanilla MOBO	✓	✗	—	—
BOFormer	✓	✗	—	✗
NAP	✗	✓	✗	✗
DANP	✗	✗	✓	✗
TAMO (ours)	✓	✓	✓	✓

Architecture

The key enabling component is a dimension-agnostic embedder: each observation \((\mathbf{x}, \mathbf{y})\) - regardless of how many input features or objectives it has — is projected to a fixed-size vector by applying scalar-to-vector maps dimension-wise, interacting via attention, adding learnable positional tokens, and mean-pooling. This lets the same backbone process tasks with different \(d_x\) and \(d_y\) without any architectural changes.

Pretraining

TAMO is pretrained offline on a distribution of synthetic multi-objective tasks drawn from Gaussian process priors with varied kernels, length scales, and input/output dimensionalities. Because the embedder is dimension-agnostic, a single pretraining run can draw tasks with different numbers of features and objectives.

Two training signals share the same transformer backbone:

Policy head & RL objective

The policy head assigns each candidate \(\mathbf{x}_i^{q}\) a scalar logit \(\alpha_i := \mathrm{MLP}_\theta(\hat{\mathbf{E}}_i^{q})\) and converts them to a categorical distribution over the query set:

\[ \pi_\theta\!\bigl(\mathbf{x}_i^{q}\mid t, T, \mathcal{H}_{1:t-1}\bigr) \;=\; \frac{\exp(\alpha_i)}{\displaystyle\sum_{r=1}^{N_q}\exp(\alpha_r)}. \]

At each step \(t\), observing \(\mathbf{y}_t\) yields a reward equal to the normalized hypervolume level, \(r_t = \mathrm{HV}(\mathcal{P}(\mathcal{D}^h)\mid\mathbf{r})\,/\,\mathrm{HV}^\star_\tau\). The policy maximizes the expected discounted return over full trajectories,

\[ J(\theta) \;=\; \mathbb{E}_{\tau\sim p(\tau)}\!\left[ \mathbb{E}_{\pi_\theta}\!\left[\sum_{t=1}^{T} \gamma^{t-1} r_t\right] \right], \]

estimated with REINFORCE. Training on complete trajectories encourages non-myopic, long-horizon Pareto improvement — in contrast to standard one-step acquisitions.

Prediction head & auxiliary objective

The prediction head models each scalar objective \(y_{i,k}^{p}\) as a \(K\)-component Gaussian mixture conditioned on a context set \(\mathcal{D}^c\):

\[ p\!\left(y_{i,k}^{p}\mid\mathbf{x}_i^{p},\mathcal{D}^{c}\right) \;=\; \sum_{\ell=1}^{K} \phi_{i\ell}\,\mathcal{N}\!\bigl(y_{i,k}^{p};\,\mu_{i\ell},\,\sigma_{i\ell}^{2}\bigr). \]

This in-context regression task is trained by minimizing the negative log-likelihood:

\[ \mathcal{L}^{(p)}(\theta) \;=\; -\,\mathbb{E}_{\tau\sim p(\tau)}\!\left[ \frac{1}{N_p\, d_y^\tau} \sum_{i=1}^{N_p}\sum_{k=1}^{d_y^\tau} \log p\!\left(y_{i,k}^{p}\mid\mathbf{x}_i^{p},\mathcal{D}^{c}\right) \right]. \]

This auxiliary task regularizes the shared representations and stabilizes policy learning.

Combined objective

The two losses are combined into a single objective, optimized jointly after an initial prediction warm-up phase:

\[ \mathcal{L}(\theta) \;=\; \lambda_p\,\mathcal{L}^{(p)}(\theta) + \mathcal{L}^{(\mathrm{rl})}(\theta), \qquad \mathcal{L}^{(\mathrm{rl})}(\theta) = -J(\theta), \]

with \(\lambda_p = 1.0\) fixed across all experiments.

Results

We evaluate TAMO on synthetic and real-world multi-objective benchmarks against GP-based MOBO baselines (qNEHVI, qNParEGO, qHVKG) and the amortized baseline BOFormer.

We report simple regret — how far the hypervolume of the approximated Pareto front is from the maximal hypervolume (lower is better) — and cumulative wall-clock proposal time, which for GP-based methods includes surrogate refitting and acquisition optimization at every step.

Results on synthetic and real-world benchmarks — **Figure 4.** Synthetic and real-world multi-objective benchmarks: simple regret (top) and cumulative inference time (bottom) vs. oracle calls (mean ± 95% CI over 30 runs). TAMO achieves competitive regret while cutting proposal time by 50–1000×.

Results on HPO-3DGS hyperparameter optimization tasks — **Figure 5.** HPO-3DGS hyperparameter optimization tasks. TAMO is competitive with the best-performing GP-based methods (with qNEHVI slightly ahead on Mic), while achieving significantly lower inference time.

Generalization

One interesting result is that one pretrained model transfers across heterogeneous problem structures with no retraining — both to unseen input dimensionalities and to settings where each objective is observed independently.

Results on out-of-distribution dimensions — **Figure 6.** Transfer to out-of-distribution input dimensions.

Results on independently observed objectives — **Figure 7.** Transfer to independently observed objectives.

Takeaways

Aspect	Finding
Pareto quality	Matches or beats specialized GP baselines and amortized methods on simple regret under tight evaluation budget.
Wall-clock speed	50–1000× lower proposal time than GP baselines and amortized methods.
Synthetic-to-real transfer	Despite training only on synthetic GP priors, TAMO transfers competitively to real-world tasks — no real-data pretraining required.
Generalization	One pretrained model across input/output dimensions and function families.
Single-objective BO	Applies effortlessly to single-objective optimization, yielding competitive results against GP baselines.

Scope & Limitations

TAMO is pretrained entirely on synthetic GP samples, which provides scale and control but may miss salient real-world structure. Performance is likely sensitive to the GP kernel family and smoothness used during pretraining. A systematic study of how corpus composition drives transfer is an important open direction.

Inference currently assumes a discrete candidate pool, which aligns well with discrete or low-dimensional design problems, but is restrictive in high-dimensional or generative settings, such as de novo molecular design. Extending TAMO to continuous proposal mechanisms (e.g., diffusion-based or autoregressive generators) is a natural next step.

BibTeX

@inproceedings{
zhang2026incontext,
title={In-Context Multi-Objective Optimization},
author={Xinyu Zhang and Conor Hassan and Julien Martinelli and Daolang Huang and Samuel Kaski},
booktitle={The Fourteenth International Conference on Learning Representations},
year={2026},
url={https://openreview.net/forum?id=odmeUlWta8}
}