Personalize Anything for Free with Diffusion Transformer

1Tsinghua Shenzhen International Graduate School    2School of Software, Beihang University   
3School of Finance, Renmin University of China    4Department of Automation, Tsinghua University
*Equal contribution   Project Lead    Corresponding Authors

Customize any subject with advanced DiT without additional fine-tuning.
The images show our method's results in single-subject personalization, layout-guided subject personalization, multi-subject personalization, and more, demonstrating strong identity preservation and versatility.

Abstract

Personalized image generation aims to produce images of user-specified concepts while enabling flexible editing. Recent training-free approaches, while exhibit higher computational efficiency than training-based methods, struggle with identity preservation, applicability, and compatibility with diffusion transformers (DiTs). In this paper, we uncover the untapped potential of DiT, where simply replacing denoising tokens with those of a reference subject achieves zero-shot subject reconstruction. This simple yet effective feature injection technique unlocks diverse scenarios, from personalization to image editing. Building upon this observation, we propose Personalize Anything, a training-free framework that achieves personalized image generation in DiT through: (1) timestep-adaptive token replacement that enforces subject consistency via early-stage injection and enhances flexibility through late-stage regularization, and (2) patch perturbation strategies to boost structural diversity. Our method seamlessly supports layout-guided generation, multi-subject personalization, and mask-controlled editing. Evaluations demonstrate state-of-the-art performance in identity preservation and versatility. Our work establishes new insights into DiTs while delivering a practical paradigm for efficient personalization.

Method


Personalize Anything anchors subject identity in early denoising through mask-guided token replacement with preserved positional encoding, and transitions to multi-modal attention for semantic fusion with text in later steps. During token replacement, we inject variations via patch perturbations. This timestep-adaptive strategy balances identity preservation and generative flexibility.

Our method enables: (a) layout-guided generation by translating token-injected regions, (b) multi-subject composition through sequential token injection, and (c) inpainting and outpainting via specifying masks and increased replacement.

Qualitative Comparisons on Single-subject Personalization


Our method produces high-fidelity images that are highly consistent with the specified subjects, without necessitating training or fine-tuning.

Qualitative Comparisons on Multi-subject Personalization


Our method manages to maintain natural interactions among subjects via layout-guided generation, while ensuring each subject retains its identical characteristics and distinctiveness.

More Applications


Our method naturally extends to diverse real-world applications, including subject-driven image generation with layout guidance, inpainting and outpainting.

Ablation Study


We conduct ablation studies on single-subject personalization, examining the effects of token replacement timestep threshold \( \tau \) and the patch perturbation strategy.

BibTeX

@article{feng2025personalize,
  title={Personalize Anything for Free with Diffusion Transformer},
  author={Feng, Haoran and Huang, Zehuan and Li, Lin and Lv, Hairong and Sheng, Lu},
  journal={arXiv preprint arXiv:2503.12590},
  year={2025}
}