Provably Overwhelming Transformer Models with Designed Inputs

Abstract

We develop an algorithm which, given a trained transformer model M as input, as well as a string of tokens s of length nfix and an integer nfree, can generate a mathematical proof that M is ``overwhelmed'' by s, in time and space O˜(n2fix+n3free). We say that M is ``overwhelmed'' by s when the output of the model evaluated on this string plus any additional string t, M(s+t), is completely insensitive to the value of the string t whenever length(t) ≤nfree. Along the way, we prove a particularly strong worst-case form of ``over-squashing'', which we use to bound the model's behavior. Our technique uses computer-aided proofs to establish this type of operationally relevant guarantee about transformer models. We empirically test our algorithm on a single layer transformer complete with an attention head, layer-norm, MLP/ReLU layers, and RoPE positional encoding. We believe that this work is a stepping stone towards the difficult task of obtaining useful guarantees for trained transformer models.

Publication Details

Author

Lev Stambler

Seyed Nezhadi

Matthew Coudron

Publication Type

Journal Article

Year of Publication

2025

URL

https://arxiv.org/abs/2502.06038

Journal