Generative AI / LLMs

Fast Inversion for Real-Time Image Editing with Text

GIF of an image changing in response to the prompt.

Text-to-image diffusion models can generate diverse, high-fidelity images based on user-provided text prompts. They operate by mapping a random sample from a high-dimensional space, z_{T} conditioned on a user-provided text prompt, through a series of denoising steps. This results in a representation of the corresponding image, z_{0}.

These models can also be used for more complex tasks such as image editing, learning to depict a personalized concept, or semantic data augmentation. In this context, image editing refers to the task of making local changes to a given image based on a text prompt, while the other parts of the image remain unchanged.

All these additional tasks involve a process called inversion: Given an image representation z_{0} and its corresponding text prompt p, you seek a noise seed z_{T} that, when fed into the denoising process, yields the reconstructed image z_{0}.

Regularized Newton-Raphson Inversion (RNRI), a novel inversion technique, was recently proposed. RNRI outperforms existing inversion approaches by balancing rapid convergence with superior accuracy, execution time, and memory efficiency, enabling real-time image editing for the first time.

Inversion as solving an implicit equation

Inverting a diffusion model requires searching in the space of possible seeds for one that would reconstruct a given image. This search may be computationally demanding.

To understand how it can be achieved efficiently, consider first the forward (noising) process.

Sampling from diffusion models can be viewed as solving an ordinary differential equation. The popular DDIM deterministic scheduler presented in Denoising Diffusion Implicit Models denoises a latent noise vector in the following way:

Equation 1

z_{t-1}=\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}z_{t} - \sqrt{\alpha_{t-1}} \cdot \Delta \psi(\alpha_{t})\cdot \epsilon_{\theta}(z_{t},t,p) \quad\quad

In this equation (equation 1), \alpha_t = 1-\beta_t, \psi(\alpha) = \sqrt{\frac{1}{\alpha}-1} and \Delta \psi(\alpha_t) = \psi(\alpha_t) - \psi(\alpha_{t-1}).

DDIM inversion

To derive inversion, the first equation is first rewritten as follows:

Equation 2

z_t = \sqrt{\frac{\alpha_t}{\alpha_{t-1}}}z_{t-1} + \sqrt{\alpha_{t}} \cdot \Delta \psi(\alpha_t) \cdot \epsilon_{\theta}(z_{t},t,p)

This gives an implicit equation in z_{t}, that cannot be solved in closed form. DDIM inversion approximates it by replacing z_{t} with z_{t-1}:

Equation 3

\approx \sqrt{\frac{\alpha_t}{\alpha_{t-1}}}z_{t-1} + \sqrt{\alpha_{t}} \cdot \Delta \psi(\alpha_t) \cdot\epsilon_{\theta}(\boxed{z_{t-1}},t,p)

DDIM inversion is a fast method but often an inaccurate inversion.

Fixed-point and gradient descent inversion methods

Several papers improve the previous approximation using iterative methods to approximately solve the second equation. For example, directly solving the equation using fixed-point iterations is a method widely used in numerical analysis for solving implicit functions. For more information, see Effective Real Image Editing with Accelerated Iterative Diffusion Inversion.

In a related way, a more precise inversion equation can be solved, obtained by employing higher-order terms using gradient descent. For more information, see On Exact Inversion of DPM-Solvers.

Fixed-point iterations and gradient descent methods provide better accuracy than DDIM, but have a linear convergence rate and may take many seconds to compute.

Regularized Newton-Raphson Inversion method

A faster and more accurate alternative is based on the well-known Newton-Raphson iterative method (NR).

NR is a method for iteratively finding the roots of a system of equations. A naive application of NR to the full latent space would require solving z_t = f(z_t). This formulation is impractical because it requires inverting a high-dimensional Jacobian matrix.

Instead, define a multivariable scalar function \hat{r}: R^d \rightarrow R:

Equation 4

\hat{r}(z_t) := ||z_t - f(z_t)||

Seek its roots \hat{r(z_t)}=0. Because \hat{r(z_t)} is a scalar function, the Jacobian matrix is a vector and can be computed quickly.

Solving equation 4 can be done quickly, but its solutions are not guaranteed to reconstruct the image well because the equation is underdetermined. Also, some roots of \hat{r(z_t)} may be out of distribution for the diffusion model.

To address this issue, add a regularization term to the NR objective:

Equation 5

q(z_{t}|z_{t-1}) := \mathcal{N}(z_{t};\mu_t=\sqrt{1-\beta_{t}}z_{t-1},\Sigma_t=\beta_{t}I)

As each noising step in the diffusion process follows a Gaussian distribution, it is incorporated as a prior over the values of z_t. The negative log-likelihood is added as a regularizing penalty term, forming the objective:

Equation 6

\L(z_t) := ||z_t - f(z_t)|| - \lambda \log q(z_t | z_{t-1})

The Newton-Raphson iteration for this function can be computed efficiently using automatic differentiation engines, initializing the process with z_{t-1} from the previous diffusion timestep. Regularized Newton Raphson Inversion (RNRI), converges in 1–2 iterations (~0.5 sec for latent consistency models).

Figure 2 compares the quality of reconstructed images (measured using PSNR) of the COCO validation set, against the time it takes to compute the inversion. It shows that RNRI improves in terms of PSNR or run time over recent methods. For a fair time comparison, run time is measured on a single NVIDIA A100 GPU for all methods. The dashed black line denotes the upper bound that is due to the inherent distortion caused by the Stable Diffusion VAE.

Two graphs comparing the performance of different image inversion methods in terms of reconstruction quality (PSNR) and runtime. The left graph shows results for a latent diffusion model, where RNRI achieves high PSNR with significantly faster inversion-reconstruction time compared to other methods. The right graph shows results for a latent consistency model, where RNRI achieves the highest PSNR in less than 0.5 seconds, much faster than the other methods.
Figure 2. Inversion results for PSNR compared to runtime

Figure 3 provides a qualitative comparison between RNRI and previous state-of-the-art inversion approaches. It shows cases where RNRI accurately edits images that have high fidelity with the input image and also adhere well to the target prompt. Alternative approaches may struggle with editing these images and prompts. Baselines were run until they converged, whereas RNRI was run for only two iterations per diffusion step.

For example, in the first row, RNRI succeeds in converting the pizza into slices of bread. Other methods either fail to achieve this or incorrectly modify other elements. In the third row, all methods struggle to accurately substitute bananas with oranges or alter the background. In contrast, RNRI accurately edits the object while maintaining the original background.

Three comparisons of various text generation models applied to a sequence of images. Each row demonstrates a transformation of an initial image with different inversion approaches.
Figure 3. RNRI edits images more naturally while preserving the structure of the original image

Evaluation of RNRI results

Following the previous work, editing performance is measured using the following metrics:

  • An LPIPS score quantifies the extent to which structure is preserved (lower is better).
  • A CLIP-based score quantifies how well the generated images match the text prompt (higher is better).

Values are averages across 100 MS-COCO images. Figure 4 shows that editing with RNRI yields a superior CLIP and LPIPS score, achieving state-of-the-art editing of real images.

Two graphs evaluating different models on text prompt compliance and image structure preservation. The left graph shows that RNRI outperforms other baselines on the Latent Diffusion Model, in terms of CLIP and LPIPS scores. The right graph indicates that RNRI achieves better performance also using the Latent Consistency Model.
Figure 4. RNRI achieves superior CLIP and LPIPS scores, indicating better compliance with text prompts and higher structure preservation

Finally, Figure 5 shows additional real-time editing results.

Conclusion

Image inversion in diffusion models is key for applications like image editing, semantic augmentation, and generating rare-concept images. Current methods often sacrifice inversion quality for computational efficiency, requiring significantly more resources for high-quality results.

Regularized Newton-Raphson Inversion (RNRI) balances rapid convergence with superior accuracy, execution time, and memory efficiency. The RNRI method outperforms existing approaches in both latent diffusion and latent consistency models, enabling real-time image editing.

For more information, see the full paper, Regularized Newton Raphson Inversion for Text-to-Image Diffusion Models. You can also try RNRI yourself.

Discuss (0)

Tags