Beyond Steering Vectors: Flow-based Activation Steering for Inference-Time Intervention

Jin, Zehao; Deng, Ruixuan; Wang, Junran; Shen, Xinjie; Zhang, Chao

arXiv 2605.05892 · 2026

Beyond Steering Vectors:
Flow-based Activation Steering
For Inference-Time Intervention

Zehao Jin^* Ruixuan Deng^* Junran Wang^* Xinjie Shen Chao Zhang

arXiv2605.05892 Codeflas-ai/FLAS 🤗 Model2B 🤗 Model9B 🤗 Demotry it →

FLAS is a natural-language activation-steering method for LLMs. Where prior work like Golden Gate Claude had to commit to a single behavior in advance, FLAS learns a single general concept-conditioned velocity field $v_\theta(h, t, c)$ that transports an unsteered activation $h$ to a steered one through $N$-step Euler integration. At inference you hand it any natural-language concept $c$ and it produces the right inference-time intervention. A single checkpoint handles thousands of unseen concepts, and is the first learned steering method to consistently outperform in-context prompting on AxBench.

1 How it works

Conventional activation steering adds a fixed direction $h \mapsto h + \alpha v$. This is a single Euler step from a single offset, ignoring where the activation sits in representation space. FLAS instead learns a velocity field conditioned on the concept embedding $c$ and on a continuous flow time $t \in [0, T]$:

$$h' \;=\; \varphi_T(h) \;=\; h \;+\; \int_0^T v_\theta\!\bigl(\varphi_t(h),\, t,\, c\bigr)\, dt.$$

We approximate this integral with $N$-step Euler integration at the chosen layer during the LM's normal forward pass. Because we sample $T \sim \mathrm{Uniform}[T_{\min}, T_{\max}]$ during training, a single checkpoint exposes continuous steering-strength control at inference. No per-strength fine-tuning, no per-concept training, no contrastive pairs.

Concept-conditioned

One frozen base LM, one frozen concept encoder, one FlowBlock. Any new concept is just a text prompt. No per-concept training.

Continuous strength

Flow time $T$ controls steering strength. Trained with $T \sim \mathrm{Uniform}[0.5, 2.0]$. Performance stays stable across $T \in [0.5, 4.0]$ at inference without per-concept tuning.

Outperforms prompting

First learned steering method to consistently outperform in-context prompting on AxBench Concept16k, on both Gemma-2-2B and 9B-IT.

2 Results on AxBench

On AxBench's Concept16k held-out split, evaluated strictly zero-shot on concepts never seen during training, FLAS is the first learned steering method to consistently outperform in-context prompting on both Gemma-2-2B-IT and Gemma-2-9B-IT, at a single fixed flow time $T = 2$ with no per-concept tuning.

Held-in HMean bar chart on Gemma-2-2B-IT plus the full AxBench results table on Gemma-2-2B-IT and Gemma-2-9B-IT, with FLAS leading both held-in and held-out columns. — Held-in harmonic mean on Gemma-2-2B-IT (left) and full AxBench results on both models (right). FLAS reaches held-out HMean $1.015$ on Gemma-2-2B-IT and $1.113$ on Gemma-2-9B-IT, exceeding in-context prompting ($0.762\,/\,1.091$) and HyperSteer ($0.608\,/\,0.934$). Baselines reproduced from AxBench and HyperSteer.

3 The flow, in 3D

Each polyline traces an activation's displacement as the learned velocity field $v_\theta$ is integrated for one (concept × prompt) pair at flow time $T = 2$, projected onto the top three principal components fit jointly across all trajectories. Drag to rotate. Click a concept in the legend to hide its trajectories. Hover any point to see what the model is steering toward.

Loading trajectory data…

4 Steering studio

Pick a held-out concept and a prompt, then move the flow time slider. The steered output updates with $T$. The C / I / F bars are the AxBench GPT-4o-mini judge's per-factor scores: Concept incorporation, Instruction following, Fluency, each $\in \{0, 1, 2\}$.

Concept Prompt Flow time T = 2.0

Concept:

Prompt:

…

5 Continuous strength control

A single FLAS checkpoint exposes continuous strength control through the flow time $T$. Across $T \in [0.5, 4.0]$, FLAS steadily improves concept incorporation while keeping instruction following and fluency near baseline. Three steering baselines (ReFT-r1, DiffMean, AcT) instead collapse on at least one factor as the steering strength increases.

Score decomposition (Concept incorporation, Instruction following, Fluency, HMean) versus flow time T, with FLAS held-in and held-out compared against ReFT-r1, DiffMean, and AcT on Gemma-2-9B-IT. — Score decomposition across flow time $T$ on Gemma-2-9B-IT, layer 20. FLAS held-in (purple) and held-out (blue) climb in concept incorporation while keeping instruction following and fluency near baseline. ReFT-r1, DiffMean, and AcT collapse on at least one factor as the steering strength increases. Shaded bands show $\pm 1$ std.

6 Try the live demo

Hosted on Hugging Face Spaces with a ZeroGPU slice. Type any concept (e.g. talk like a pirate) and a prompt, then compare the steered and baseline outputs side by side.

🤗 Spaces Lunamos / flas-demo open in new tab ↗

7 Citation

@article{flas2026,
  title  = {Beyond Steering Vector: Flow-based Activation Steering for Inference-Time Intervention},
  author = {Zehao Jin and Ruixuan Deng and Junran Wang and Xinjie Shen and Chao Zhang},
  year   = {2026},
  eprint = {2605.05892},
  archivePrefix = {arXiv},
  primaryClass  = {cs.CL},
  url    = {https://arxiv.org/abs/2605.05892},
}