The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces
View PDF
HTML (experimental)
Abstract:Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantical...
Read more at arxiv.org