DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training

TL;DR: We propose DiT360, a two-level framework integrating image- and token-level strategies to enhance perceptual realism and geometric fidelity in panoramic image generation.

In this work, we propose DiT360, a DiT-based framework that performs hybrid training on perspective and panoramic data for panoramic image generation. For the issues of maintaining geometric fidelity and photorealism in generation quality, we attribute the main reason to the lack of large-scale, high-quality, real-world panoramic data, where such a data-centric view differs from prior methods that focus on model design. Basically, DiT360 has several key modules for inter-domain transformation and intra-domain augmentation, applied at both the pre-VAE image level and the post-VAE token level. At the image level, we incorporate cross-domain knowledge through perspective image guidance and panoramic refinement, which enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision is applied across multiple modules, which include circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Extensive experiments on text-to-panorama, inpainting, and outpainting tasks demonstrate that our method achieves better boundary consistency and image fidelity across eleven quantitative metrics.

Outdoor Scene Generation | Text to Panorama

Indoor Scene Generation | Text to Panorama

Editing | Inpainting

A mountain monastery clings to the cliffs, surrounded by clouds and prayer flags.

A coastal fortress stands tall above crashing ocean waves and rugged cliffs.

Editing | Outpainting

Petra, Jordan, with rock-cut architecture and desert surroundings.

Sydney Opera House with harbor and skyscrapers.

Methodology

Pipeline of the method

Building upon Diffusion Transformer, we introduce DiT360 for panoramic image generation, which adopts a hybrid paradigm to jointly exploit perspective and panoramic data through two training branches. The key modules enabling hybrid training are categorized into image-level regularization and token-level supervision.

At the image level, perspective image guidance and panoramic refinement introduce cross-domain knowledge to enhance perceptual quality while regularizing diversity and photorealism. At the token level, hybrid supervision across multiple objectives is conducted, which includes circular padding for boundary continuity, yaw loss for rotational robustness, and cube loss for distortion awareness. Together, this hybrid design operates across multiple representation levels to achieve perceptual photorealism and geometric fidelity.

Citation

If you find our work useful, please consider citing:

@misc{dit360, title={DiT360: High-Fidelity Panoramic Image Generation via Hybrid Training}, author={Haoran Feng and Dizhe Zhang and Xiangtai Li and Bo Du and Lu Qi}, year={2025}, eprint={2510.11712}, archivePrefix={arXiv}, }