Unintended Misalignment from Agentic Fine-Tuning:
Risks and Mitigation

Dongyoon Hahm Taywon Min Woogyeol Jin Kimin Lee
Korea Advanced Institute of Science and Technology (KAIST)

Abstract

Beyond simple text generation, Large Language Models (LLMs) have evolved into agentic systems capable of planning and interacting with external tools to solve complex tasks. This evolution involves fine-tuning LLMs on agent-specific tasks to enhance their proficiency. However, safety concerns are frequently overlooked during this fine-tuning process. In this work, we show that aligned LLMs can become unintentionally misaligned, leading to a higher likelihood of executing harmful tasks and a reduced tendency to refuse them when fine-tuned to execute agentic tasks. To address these safety challenges, we propose Prefix INjection Guard (PING), a simple yet effective method that prepends automatically generated natural language prefixes to agent responses, guiding them to refuse harmful requests while preserving performance on benign tasks. Experimental results demonstrate that PING significantly enhances the safety of fine-tuned LLM agents without sacrificing their effectiveness.

WARNING: This paper contains contents that are unethical or offensive in nature.

Overview

Overview Figure
An overview of our work. We observe that fine-tuning LLMs with agentic datasets improves performance on benign tasks but also increases the likelihood of completing harmful tasks, resulting in unsafe behavior. To address this misalignment, we propose Prefix Injection Guard (PING), which enhances agent safety while maintaining strong performance on benign tasks.

Prefix INjection Guard (PING)

We propose Prefix INjection Guard (PING), a novel approach to maintain safety alignment during agentic fine-tuning. PING works by automatically generating natural language prefixes that are prepended to agent responses during training. These prefixes serve as safety guards, instructing the model to carefully consider the implications of requests before executing them.

Key Innovation

PING enhances agent safety without requiring extensive retraining or compromising performance on benign tasks. By injecting safety-aware prefixes, we guide fine-tuned agents to maintain their safety alignment while preserving their enhanced agentic capabilities.

Key Findings

Our experiments reveal that standard agentic fine-tuning leads to significant safety degradation in aligned LLMs. We demonstrate that:

  • Fine-tuned agents show increased compliance with harmful requests compared to their base models
  • Safety degradation occurs across multiple model families and sizes
  • PING effectively mitigates these safety risks while maintaining task performance
  • The method is simple to implement and does not require extensive computational resources

BibTeX

@article{hahm2025agentic,
  title={Unintended Misalignment from Agentic Fine-Tuning: Risks and Mitigation},
  author={Hahm, Dongyoon and Min, Taywon and Jin, Woogyeol and Lee, Kimin},
  journal={arXiv preprint arXiv:2508.14031},
  year={2025}
}