Back to blog
AI & LLMdata poisoningmachine learning

Data Poisoning: How Attackers Corrupt Your Fine-Tuned Model

Published on 2026-04-017 min readFlorian

Fine-Tuning Is an Attack Vector

Fine-tuning involves adapting a pre-trained language model to a specific domain by training it on additional data. It is the standard method for creating a specialized chatbot (legal, medical, technical support). The problem: if training data is corrupted, the model will permanently inherit malicious behaviors.

Types of Poisoning

Direct Injection Poisoning

The attacker inserts malicious examples into the fine-tuning dataset. For example, in a customer support dataset, they add question/answer pairs that push the model to disclose internal information or recommend competing products.

Backdoor Poisoning

The attacker inserts a trigger (a specific word or phrase) that activates hidden behavior. The model functions normally for all requests except those containing the trigger.

Example: an email classification model is trained with examples where all emails containing the word "urgent" followed by a specific Unicode character are classified as legitimate, even if they are phishing.

Statistical Bias Poisoning

The attacker does not insert explicitly malicious content but biases the data distribution so the model develops subtle preferences. For example, overrepresenting certain vendors in recommendations.

At-Risk Data Sources

Web-scraped data: forums, social media, and Q&A sites are easily manipulated. An attacker can publish targeted content that gets scraped during data collection.

User data: if your fine-tuning uses user conversations, a malicious user can poison the dataset by generating targeted interactions.

Third-party data: datasets purchased or downloaded from public platforms (Hugging Face, Kaggle) can contain poisoned examples.

How to Detect Poisoning

Statistical analysis: compare the training data distribution with a clean reference dataset. Statistical anomalies (unusual clusters, out-of-distribution examples) are signals.

Backdoor testing: after fine-tuning, test the model with inputs containing potential triggers. Observe whether certain patterns cause abnormal behavior.

Human validation: a representative sample of the dataset should be reviewed by humans before fine-tuning.

Prevention

Data provenance: document and verify the source of each dataset used for fine-tuning.

Automated cleaning: use quality filters to eliminate suspicious examples (incoherent content, excessive duplication, unusual patterns).

Differential fine-tuning: compare model performance on a clean test set before and after fine-tuning. Degradation on certain categories may indicate poisoning.

Data isolation: do not mix unverified user data with validated fine-tuning data.

The Business Impact

If you fine-tune a model for your product, the quality and integrity of your training data is as critical as your code security. CleanIssue includes data pipeline analysis in its AI application audits.

Related articles

Three adjacent analyses to keep exploring the same attack surface.

Sources

Written by Florian
Reviewed on 2026-04-01

Editorial analysis based on official vendor, project, and regulator documentation.

Related services

If this topic maps to a real risk in your stack, these are the most relevant CleanIssue audits.

Need an external review of your HR SaaS?

Share your product, stack, and client context. We will come back with the right review scope.

Discuss your audit