AI & LLMdata poisoningmachine learning

Data Poisoning: How Attackers Corrupt Your Fine-Tuned Model

Name: CleanIssue
Price range: €€

Published on 2026-04-017 min readCleanIssue

> TL;DR: Training data poisoning allows attackers to manipulate fine-tuned LLM behavior. Techniques, detection, and prevention.

Fine-Tuning Is an Attack Vector

Fine-tuning involves adapting a pre-trained language model to a specific domain by training it on additional data. It is the standard method for creating a specialized chatbot (legal, medical, technical support). The problem: if training data is corrupted, the model will permanently inherit malicious behaviors.

Types of Poisoning

Direct Injection Poisoning

The attacker inserts malicious examples into the fine-tuning dataset. For example, in a customer support dataset, they add question/answer pairs that push the model to disclose internal information or recommend competing products.

Backdoor Poisoning

The attacker inserts a trigger (a specific word or phrase) that activates hidden behavior. The model functions normally for all requests except those containing the trigger.

Example: an email classification model is trained with examples where all emails containing the word "urgent" followed by a specific Unicode character are classified as legitimate, even if they are phishing.

Statistical Bias Poisoning

The attacker does not insert explicitly malicious content but biases the data distribution so the model develops subtle preferences. For example, overrepresenting certain vendors in recommendations.

At-Risk Data Sources

Web-scraped data: forums, social media, and Q&A sites are easily manipulated. An attacker can publish targeted content that gets scraped during data collection.

User data: if your fine-tuning uses user conversations, a malicious user can poison the dataset by generating targeted interactions.

Third-party data: datasets purchased or downloaded from public platforms (Hugging Face, Kaggle) can contain poisoned examples.

How to Detect Poisoning

Statistical analysis: compare the training data distribution with a clean reference dataset. Statistical anomalies (unusual clusters, out-of-distribution examples) are signals.

Backdoor testing: after fine-tuning, test the model with inputs containing potential triggers. Observe whether certain patterns cause abnormal behavior.

Human validation: a representative sample of the dataset should be reviewed by humans before fine-tuning.

Prevention

Data provenance: document and verify the source of each dataset used for fine-tuning.

Automated cleaning: use quality filters to eliminate suspicious examples (incoherent content, excessive duplication, unusual patterns).

Differential fine-tuning: compare model performance on a clean test set before and after fine-tuning. Degradation on certain categories may indicate poisoning.

Data isolation: do not mix unverified user data with validated fine-tuning data.

The Business Impact

If you fine-tune a model for your product, the quality and integrity of your training data is as critical as your code security. CleanIssue includes data pipeline analysis in its AI application audits.

Key Takeaways

Identify and test your exposed attack surfaces before a third party does.

Client-side security controls never replace server-side validation.

Regular audits are more effective than one-time checks — vulnerabilities appear with every deployment.

Building HR, payroll, or recruiting software? CleanIssue performs security audits for HR SaaS in real-world conditions, no source code access needed. For a first read of your exposure, start with an external review of your application.

Three adjacent analyses to keep exploring the same attack surface.

AI & LLMprompt injection