Researchers extract copyrighted books near-verbatim from Claude, Gemini, GPT-4, and Grok; some models output 95% of Harry Potter without jailbreaking, raising legal concerns

Extracting books from production language models

View PDF HTML (experimental) Abstract:Many unresolved legal questions over LLMs and copyright center on memorization: whether specific training data have been encoded in the model's weights during training, and whether those memorized data can be extracted in the model's outputs. While many believe that LLMs do not memorize much of their training data, recent work shows that substantial amounts of copyrighted text can be extracted from open-weight models. However, it remains an open question if ...