Many real-world applications, such as interactive photo retouching, artistic content creation, and product design, require flexible and iterative image editing. However, existing image editing methods primarily focus on achieving the desired modifications in a single step, which often struggles with ambiguous user intent, complex transformations, or the need for progressive refinements. As a result, these methods frequently produce inconsistent outcomes or fail to meet user expectations. To address these challenges, we propose a multi-turn image editing framework that enables users to iteratively refine their edits, progressively achieving more satisfactory results. Our approach leverages flow matching for accurate image inversion and a dual-objective Linear Quadratic Regulators (LQR) for stable sampling, effectively mitigating error accumulation. Additionally, by analyzing the layer-wise roles of transformers, we introduce a adaptive attention highlighting method that enhances editability while preserving multi-turn coherence. Extensive experiments demonstrate that our framework significantly improves edit success rates and visual fidelity compared to existing methods.
We visualize the differences in single-step and multi-round accumulative errors during inversion (←) and editing (↘) across different ReFlow-based editing methods.
(a) Vanilla ReFlow struggles with structure preservation during inversion due to the truncation error of the Euler method.
(b) While a second-order ODE solver reduces truncation error in a single step, the accumulated error over multiple editing rounds remains significant.
(c) Incorporating the source image as guidance (dotted ↙) via LQR improves performance in a single step but becomes less effective as accumulated errors increase with more steps.
(d) Our approach addresses this issue by integrating both techniques, leveraging a dual-objective LQR coupled with a high-order solver to enhance stability and accuracy.
In each editing iteration, a high-accuracy rectified flow inversion maps the image back to the Gaussian noise space, followed by sampling to generate the edited images. To better constrain the distribution of edits across multiple turns, the original image and previous editing results serve as guidance during subsequent sampling. Additionally, a highlighted region in the attention mask further preserves the content structure of the edited outputs.
In the demo, users start by uploading one or more images. They select a base image and enter a text prompt to begin editing. After each editing round, users can choose a previously generated image from the history as the base image to continue the editing process.