Data Poisoning: How Attackers Corrupt Your Fine-Tuned Model
Fine-Tuning Is an Attack Vector
Fine-tuning involves adapting a pre-trained language model to a specific domain by training it on additional data. It is the standard method for creating a specialized chatbot (legal, medical, technical support). The problem: if training data is corrupted, the model will permanently inherit malicious behaviors.
Types of Poisoning
Direct Injection Poisoning
The attacker inserts malicious examples into the fine-tuning dataset. For example, in a customer support dataset, they add question/answer pairs that push the model to disclose internal information or recommend competing products.
Backdoor Poisoning
The attacker inserts a trigger (a specific word or phrase) that activates hidden behavior. The model functions normally for all requests except those containing the trigger.
Example: an email classification model is trained with examples where all emails containing the word "urgent" followed by a specific Unicode character are classified as legitimate, even if they are phishing.
Statistical Bias Poisoning
The attacker does not insert explicitly malicious content but biases the data distribution so the model develops subtle preferences. For example, overrepresenting certain vendors in recommendations.
At-Risk Data Sources
Web-scraped data: forums, social media, and Q&A sites are easily manipulated. An attacker can publish targeted content that gets scraped during data collection.
User data: if your fine-tuning uses user conversations, a malicious user can poison the dataset by generating targeted interactions.
Third-party data: datasets purchased or downloaded from public platforms (Hugging Face, Kaggle) can contain poisoned examples.
How to Detect Poisoning
Statistical analysis: compare the training data distribution with a clean reference dataset. Statistical anomalies (unusual clusters, out-of-distribution examples) are signals.
Backdoor testing: after fine-tuning, test the model with inputs containing potential triggers. Observe whether certain patterns cause abnormal behavior.
Human validation: a representative sample of the dataset should be reviewed by humans before fine-tuning.
Prevention
Data provenance: document and verify the source of each dataset used for fine-tuning.
Automated cleaning: use quality filters to eliminate suspicious examples (incoherent content, excessive duplication, unusual patterns).
Differential fine-tuning: compare model performance on a clean test set before and after fine-tuning. Degradation on certain categories may indicate poisoning.
Data isolation: do not mix unverified user data with validated fine-tuning data.
The Business Impact
If you fine-tune a model for your product, the quality and integrity of your training data is as critical as your code security. CleanIssue includes data pipeline analysis in its AI application audits.
Related articles
Three adjacent analyses to keep exploring the same attack surface.
Prompt Injection: How Attackers Manipulate Your AI Chatbot
Direct and indirect prompt injection techniques, real examples, and defenses to protect your AI applications from manipulation.
MCP Security: What to Audit When Your AI Talks to Your Database
The Model Context Protocol (MCP) connects LLMs to your internal tools. Critical audit points to secure these connections.
Chatbot Leaks: 5 Ways Your Customer-Facing AI Bot Exposes Your Data
Enterprise AI chatbots leak data in 5 different ways. Identification of vectors and concrete solutions.
Sources
Editorial analysis based on official vendor, project, and regulator documentation.
Related services
If this topic maps to a real risk in your stack, these are the most relevant CleanIssue audits.