Researchers Extract Copyrighted Books from AI Models; Llama 3.1 70B Memorizes Entire Harry Potter, 1984

Extracting memorized pieces of (copyrighted) books from open-weight language models

View PDF Abstract:Plaintiffs and defendants in copyright lawsuits over generative AI often make sweeping, opposing claims about the extent to which large language models (LLMs) have memorized plaintiffs' protected expression. Drawing on adversarial ML and copyright law, we show that these polarized positions dramatically oversimplify the relationship between memorization and copyright. To do so, we leverage a recent probabilistic extraction technique to extract pieces of the Books3 dataset from ...