AI models can acquire backdoors from surprisingly few malicious documents

Fine-tuning experiments with 100,000 clean samples versus 1,000 clean samples showed similar attack success rates when the number of malicious examples stayed constant. For GPT-3.5-turbo, between 50 and 90 malicious samples achieved over 80 percent attack success across dataset sizes spanning two orders of magnitude.

Limitations

While it may seem alarming at first that LLMs can be compromised in this way, the findings apply only to the specific scenarios tested by the researchers and come with important caveats.

“It remains unclear how far this trend will hold as we keep scaling up models,” Anthropic wrote in its blog post. “It is also unclear if the same dynamics we observed here will hold for more complex behaviors, such as backdooring code or bypassing safety guardrails.”

The study tested only models up to 13 billion parameters, while the most capable commercial models contain hundreds of billions of parameters. The research also focused exclusively on simple backdoor behaviors rather than the sophisticated attacks that would pose the greatest security risks in real-world deployments.

Also, the backdoors can be largely fixed by the safety training companies already do. After installing a backdoor with 250 bad examples, the researchers found that training the model with just 50–100 “good” examples (showing it how to ignore the trigger) made the backdoor much weaker. With 2,000 good examples, the backdoor basically disappeared. Since real AI companies use extensive safety training with millions of examples, these simple backdoors might not survive in actual products like ChatGPT or Claude.

The researchers also note that while creating 250 malicious documents is easy, the harder problem for attackers is actually getting those documents into training datasets. Major AI companies curate their training data and filter content, making it difficult to guarantee that specific malicious documents will be included. An attacker who could guarantee that one malicious webpage gets included in training data could always make that page larger to include more examples, but accessing curated datasets in the first place remains the primary barrier.

Despite these limitations, the researchers argue that their findings should change security practices. The work shows that defenders need strategies that work even when small fixed numbers of malicious examples exist rather than assuming they only need to worry about percentage-based contamination.

“Our results suggest that injecting backdoors through data poisoning may be easier for large models than previously believed as the number of poisons required does not scale up with model size,” the researchers wrote, “highlighting the need for more research on defences to mitigate this risk in future models.”

What's Hot

Stop falling for scams when Norton’s antivirus software is 70% off right now

Acer Promo Codes and Deals: Save 40% on Bundles

Playing Wolfenstein 3D with one hand in 2026

Stop falling for scams when Norton’s antivirus software is 70% off right now

Acer Promo Codes and Deals: Save 40% on Bundles

Playing Wolfenstein 3D with one hand in 2026

Whoop has LeBron – now it wants your mom

Sony temporarily suspends memory card sales due to shortages

Apple TV is now home to CrunchyRoll anime

The Mesh Router Placement Strategy That Finally Gave Me Full Home Coverage

Discord will require a face scan or ID for full access next month

Best Stores for Buying MP3 and Digital Music You Can Keep Forever

Most Popular

The Mesh Router Placement Strategy That Finally Gave Me Full Home Coverage

Discord will require a face scan or ID for full access next month

Best Stores for Buying MP3 and Digital Music You Can Keep Forever

Our Picks

Stop falling for scams when Norton’s antivirus software is 70% off right now

Acer Promo Codes and Deals: Save 40% on Bundles

Playing Wolfenstein 3D with one hand in 2026

Subscribe to Updates

What's Hot

AI models can acquire backdoors from surprisingly few malicious documents

Limitations

Related Posts

Subscribe to Updates