Researchers unveil Cross-Trace Verification Protocol to detect backdoors in AI-generated code by analyzing execution traces across equivalent programs, proving adversaries cannot bypass through training.

The Double Life of Code World Models: Provably Unmasking Malicious Behavior Through Execution Traces

View PDF HTML (experimental) Abstract:Large language models (LLMs) increasingly generate code with minimal human oversight, raising critical concerns about backdoor injection and malicious behavior. We present Cross-Trace Verification Protocol (CTVP), a novel AI control framework that verifies untrusted code-generating models through semantic orbit analysis. Rather than directly executing potentially malicious code, CTVP leverages the model's own predictions of execution traces across semantical...