Fast Inversion for Real-Time Image Editing with Text

Text-to-image diffusion models can generate diverse, high-fidelity images based on user-provided text prompts. They operate by mapping a random sample from a high-dimensional space, \(z_{T}\) conditioned on a user-provided text prompt, through a series of denoising steps. This results in a representation of the corresponding image, \(z_{0}\).

These models can also be used for more complex tasks such as image editing, learning to depict a personalized concept, or semantic data augmentation. In this context, image editing refers to the task of making local changes to a given image based on a text prompt, while the other parts of the image remain unchanged.

All these additional tasks involve a process called inversion: Given an image representation \(z_{0}\) and its corresponding text prompt \(p\), you seek a noise seed \(z_{T}\) that, when fed into the denoising process, yields the reconstructed image \(z_{0}\).

Regularized Newton-Raphson Inversion (RNRI), a novel inversion technique, was recently proposed. RNRI outperforms existing inversion approaches by balancing rapid convergence with superior accuracy, execution time, and memory efficiency, enabling real-time image editing for the first time.

GIF shows real-time editing of several images. Given a photo of a lion sitting in the grass, a text prompt is used to transform the lion into a raccoon while preserving the background. All edits involve two processes, inversion and generation, both being fast to make the full process interactive. — *a) a lion is sitting in the green grass at sunset*

GIF shows real-time editing of several images. Given a photo of a cat sitting next to a glass vase with flowers, a text prompt is used to transform the cat into a fish while preserving the background. All edits involve two processes, inversion and generation, both being fast to make the full process interactive. — *a) a lion is sitting in the green grass at sunset*

Inversion as solving an implicit equation

Inverting a diffusion model requires searching in the space of possible seeds for one that would reconstruct a given image. This search may be computationally demanding.

To understand how it can be achieved efficiently, consider first the forward (noising) process.

Sampling from diffusion models can be viewed as solving an ordinary differential equation. The popular DDIM deterministic scheduler presented in Denoising Diffusion Implicit Models denoises a latent noise vector in the following way:

Equation 1

\(z_{t-1}=\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}z_{t} – \sqrt{\alpha_{t-1}} \cdot \Delta \psi(\alpha_{t})\cdot \epsilon_{\theta}(z_{t},t,p) \quad\quad\)

In this equation (equation 1), \(\alpha_t = 1-\beta_t\), \(\psi(\alpha) = \sqrt{\frac{1}{\alpha}-1}\) and \(\Delta \psi(\alpha_t) = \psi(\alpha_t) – \psi(\alpha_{t-1})\).

DDIM inversion

To derive inversion, the first equation is first rewritten as follows:

Equation 2

\(z_t = \sqrt{\frac{\alpha_t}{\alpha_{t-1}}}z_{t-1} + \sqrt{\alpha_{t}} \cdot \Delta \psi(\alpha_t) \cdot \epsilon_{\theta}(z_{t},t,p)\)

This gives an implicit equation in \(z_{t}\), that cannot be solved in closed form. DDIM inversion approximates it by replacing \(z_{t}\) with \(z_{t-1}\):

Equation 3

\(\approx \sqrt{\frac{\alpha_t}{\alpha_{t-1}}}z_{t-1} + \sqrt{\alpha_{t}} \cdot \Delta \psi(\alpha_t) \cdot\epsilon_{\theta}(\boxed{z_{t-1}},t,p)\)

DDIM inversion is a fast method but often an inaccurate inversion.

Fixed-point and gradient descent inversion methods

Several papers improve the previous approximation using iterative methods to approximately solve the second equation. For example, directly solving the equation using fixed-point iterations is a method widely used in numerical analysis for solving implicit functions. For more information, see Effective Real Image Editing with Accelerated Iterative Diffusion Inversion.

In a related way, a more precise inversion equation can be solved, obtained by employing higher-order terms using gradient descent. For more information, see On Exact Inversion of DPM-Solvers.

Fixed-point iterations and gradient descent methods provide better accuracy than DDIM, but have a linear convergence rate and may take many seconds to compute.

Regularized Newton-Raphson Inversion method

A faster and more accurate alternative is based on the well-known Newton-Raphson iterative method (NR).

NR is a method for iteratively finding the roots of a system of equations. A naive application of NR to the full latent space would require solving \(z_t = f(z_t)\). This formulation is impractical because it requires inverting a high-dimensional Jacobian matrix.

Instead, define a multivariable scalar function \(\hat{r}: R^d \rightarrow R\):

Equation 4

\(\hat{r}(z_t) := ||z_t – f(z_t)||\)

Seek its roots \(\hat{r(z_t)}=0\). Because \(\hat{r(z_t)}\) is a scalar function, the Jacobian matrix is a vector and can be computed quickly.

Solving equation 4 can be done quickly, but its solutions are not guaranteed to reconstruct the image well because the equation is underdetermined. Also, some roots of \(\hat{r(z_t)}\) may be out of distribution for the diffusion model.

To address this issue, add a regularization term to the NR objective:

Equation 5

\(q(z_{t}|z_{t-1}) := \mathcal{N}(z_{t};\mu_t=\sqrt{1-\beta_{t}}z_{t-1},\Sigma_t=\beta_{t}I)\)

As each noising step in the diffusion process follows a Gaussian distribution, it is incorporated as a prior over the values of \(z_t\). The negative log-likelihood is added as a regularizing penalty term, forming the objective:

Equation 6

\(\L(z_t) := ||z_t – f(z_t)|| – \lambda \log q(z_t | z_{t-1})\)

The Newton-Raphson iteration for this function can be computed efficiently using automatic differentiation engines, initializing the process with \(z_{t-1}\) from the previous diffusion timestep. Regularized Newton Raphson Inversion (RNRI), converges in 1–2 iterations (~0.5 sec for latent consistency models).

Figure 2 compares the quality of reconstructed images (measured using PSNR) of the COCO validation set, against the time it takes to compute the inversion. It shows that RNRI improves in terms of PSNR or run time over recent methods. For a fair time comparison, run time is measured on a single NVIDIA A100 GPU for all methods. The dashed black line denotes the upper bound that is due to the inherent distortion caused by the Stable Diffusion VAE.

Figure 3 provides a qualitative comparison between RNRI and previous state-of-the-art inversion approaches. It shows cases where RNRI accurately edits images that have high fidelity with the input image and also adhere well to the target prompt. Alternative approaches may struggle with editing these images and prompts. Baselines were run until they converged, whereas RNRI was run for only two iterations per diffusion step.

For example, in the first row, RNRI succeeds in converting the pizza into slices of bread. Other methods either fail to achieve this or incorrectly modify other elements. In the third row, all methods struggle to accurately substitute bananas with oranges or alter the background. In contrast, RNRI accurately edits the object while maintaining the original background.

Evaluation of RNRI results

Following the previous work, editing performance is measured using the following metrics:

An LPIPS score quantifies the extent to which structure is preserved (lower is better).
A CLIP-based score quantifies how well the generated images match the text prompt (higher is better).

Values are averages across 100 MS-COCO images. Figure 4 shows that editing with RNRI yields a superior CLIP and LPIPS score, achieving state-of-the-art editing of real images.

Finally, Figure 5 shows additional real-time editing results.

GIF shows real-time editing of various images. One part features a photo of a desk with a laptop, a cup of coffee near a black pen, and a smartphone. The coffee is first transformed into a glass of milk, then the pen changes to yellow. Next, the phone becomes a metal box. The photo's lighting is then altered to show a night scene, followed by a daytime scene. — *a) A laptop, a cup of milk, a black pen, and a smartphone*

GIF shows real-time editing of various images. One part features a photo of a cartoon of a wooden board with tomatoes and grapes and carrots on the tablecloth. The tomatoes are transformed into grapes, near a knife. Next, the carrots are transformed into oranges on a green tablecloth. — *a) A laptop, a cup of milk, a black pen, and a smartphone*

Conclusion

Image inversion in diffusion models is key for applications like image editing, semantic augmentation, and generating rare-concept images. Current methods often sacrifice inversion quality for computational efficiency, requiring significantly more resources for high-quality results.

Regularized Newton-Raphson Inversion (RNRI) balances rapid convergence with superior accuracy, execution time, and memory efficiency. The RNRI method outperforms existing approaches in both latent diffusion and latent consistency models, enabling real-time image editing.

For more information, see the full paper, Regularized Newton Raphson Inversion for Text-to-Image Diffusion Models. You can also try RNRI yourself.