Multi-turn Consistent Image Editing

1Institute of Automation, Chinese Academy of Sciences 2Institute of Computing Technology, Chinese Academy of Sciences
*Corresponding author

Abstract

Many real-world applications, such as interactive photo retouching, artistic content creation, and product design, require flexible and iterative image editing. However, existing image editing methods primarily focus on achieving the desired modifications in a single step, which often struggles with ambiguous user intent, complex transformations, or the need for progressive refinements. As a result, these methods frequently produce inconsistent outcomes or fail to meet user expectations. To address these challenges, we propose a multi-turn image editing framework that enables users to iteratively refine their edits, progressively achieving more satisfactory results. Our approach leverages flow matching for accurate image inversion and a dual-objective Linear Quadratic Regulators (LQR) for stable sampling, effectively mitigating error accumulation. Additionally, by analyzing the layer-wise roles of transformers, we introduce a adaptive attention highlighting method that enhances editability while preserving multi-turn coherence. Extensive experiments demonstrate that our framework significantly improves edit success rates and visual fidelity compared to existing methods.

Dual-objective LQR Integrete with High-order ODE

We visualize the differences in single-step and multi-round accumulative errors during inversion (←) and editing (↘) across different ReFlow-based editing methods.

(a) Vanilla ReFlow struggles with structure preservation during inversion due to the truncation error of the Euler method.

(b) While a second-order ODE solver reduces truncation error in a single step, the accumulated error over multiple editing rounds remains significant.

(c) Incorporating the source image as guidance (dotted ↙) via LQR improves performance in a single step but becomes less effective as accumulated errors increase with more steps.

(d) Our approach addresses this issue by integrating both techniques, leveraging a dual-objective LQR coupled with a high-order solver to enhance stability and accuracy.

Dual-objective

Pipeline

In each editing iteration, a high-accuracy rectified flow inversion maps the image back to the Gaussian noise space, followed by sampling to generate the edited images. To better constrain the distribution of edits across multiple turns, the original image and previous editing results serve as guidance during subsequent sampling. Additionally, a highlighted region in the attention mask further preserves the content structure of the edited outputs.

Pipeline

M-turn Reconstruction

Image 1
vermeer
Image 1
turn #1
Image 1
turn #2
Image 1
turn #4
Image 1
turn #6
Image 1
turn #8
Image 1
turn #10
Image 1
bird
Image 1
turn #1
Image 1
turn #2
Image 1
turn #4
Image 1
turn #6
Image 1
turn #8
Image 1
turn #10

Multi-turn Editing

Image 1
source
Image 2
aged
Image 3
dogcat
Image 4
cat with a blue collar
Image 5
old woman in red dress
Image 1
source
Image 2
+ hat
Image 3
[blue] hat
Image 4
+ sunglasses
Image 5
+ blue scarf
Image 1
source
Image 2
birdbutterfly
Image 3
+ pink flower
Image 4
+ ladybug
Image 5
reconstruction
Image 1
source
Image 2
phonecoffee
Image 3
+ book
Image 4
[red] skirt
Image 5
booka bunch of flower

Display Demo

In the demo, users start by uploading one or more images. They select a base image and enter a text prompt to begin editing. After each editing round, users can choose a previously generated image from the history as the base image to continue the editing process.