This repository is the official code for the paper "One Stone with Two Birds: A Null-Text-Null Frequency-Aware Diffusion Models for Text-Guided Image Inpainting" by Haipeng Liu (hpliu_hfut@hotmail.com), Yang Wang* (corresponding author: yangwang@hfut.edu.cn), Meng Wang. NeurIPS 2025, San Diego, USA
Text-guided image inpainting aims at reconstructing the masked regions as per text prompts, where the longstanding challenges lie in the preservation for unmasked regions, while achieving the semantics consistency between unmasked and inpainted masked regions. Previous arts failed to address both of them, always with either of them to be remedied. Such facts, as we observed, stem from the entanglement of the hybrid (e.g., mid-and-low) frequency bands that encode varied image properties, which exhibit different robustness to text prompts during the denoising process.
In this paper, we propose a null-text-null frequency-aware diffusion models, dubbed NTN-Diff, for text-guided image inpainting, by decomposing the semantics consistency across masked and unmasked regions into the consistencies as per each frequency band, while preserving the unmasked regions, to circumvent two challenges in a row. Based on the diffusion process, we further divide the denoising process into early (high-level noise) and late (low-level noise) stages, where the mid-and-low frequency bands are disentangled during the denoising process. As observed that, the stable mid-frequency band is progressively denoised to be semantically aligned during text-guided denoising process, which, meanwhile, serves as the guidance to the null-text denoising process to denoise low-frequency band for the masked regions, followed by a subsequent text-guided denoising process at late stage, to achieve the semantics consistency for mid-and-low frequency bands across masked and unmasked regions, while preserve the unmasked regions. Extensive experiments validate the superiority of NTN-Diff over the state-of-the-art diffusion models to text-guided diffusion models.
-
Early Stage for Null-Text-Null Frequency-Aware Diffusion Models:
- Null-Text Low-Frequency Aware Denoising Process:
$$ \hat{z}^{un}_{t} = z^{gt}_{T-t} \odot m_{z} + z^{un}_{t} \odot (1 - m_{z}), $$ - Text-Guided Denoising Process
$$ \tilde{z}_{t}^{text} = \text{IDCT}\left(\text{DCT}(z_{t}^{un}) \odot m_{{low}} + \text{DCT}(z_{t}^{text}) \odot (1 - m_{{low}})\right), $$ - Null-Text Mid-Frequency Aware Denoising Process
$$ \tilde{z}^{in}_{t} = \text{IDCT}\left(\text{DCT}(\tilde{z}_{t}^{text}) \odot m_{\text{mid}} + \text{DCT}({z}^{in}_{t}) \odot (1 - m_{\text{mid}})\right), $$ -
Late Stage of Text-Guided Denoising Process
Figure 1. Illustration of the proposed NTN-Diff pipeline.
Figure 2. Illustration of (a) denoised low-frequency band layer and (b) mid-frequency band layer.
-
Dataset Preparation: BrushBench
-
Pre-trained models: Realistic Vision V6.0 B1
-
Run the following command:
Python3 test.py
- Inpainted Image: Baidu
If any part of our paper and repository is helpful to your work, please generously cite with:








