What it does, in plain words

A text prompt can describe the scene but it can't show the model the look. "Painterly fantasy battle map" means a hundred different things to a generator. IP-Adapter is the way to hand the model a small reference picture and say "make the next tile look like this one." The picture rides alongside the text prompt and steers brushwork, palette, and lighting — without dictating composition.

There's a knob, called the IP-Adapter scale, that controls how strongly the reference image influences the output. Turn it down and the model basically ignores the picture. Turn it up and the picture starts to dominate — including its content, not just its style. The interesting work is finding the right setting for each tier of the pipeline.

v1 vs v2 — what changed

The painter uses the v2 ("Plus") version. The original IP-Adapter compressed the reference image into a single feature vector before feeding it to the model — fine for rough style cues, lossy on details. v2 keeps several layers of the image representation and lets the model attend across all of them. In practice, that preserves brushwork and texture more faithfully without taking away the text prompt's control over what's actually in the scene.

Under the hood — the technical version

IP-Adapter ( Ye et al. 2023 ) adds a parallel cross-attention path to a diffusion model so it can take an image as conditioning alongside (or instead of) a text prompt. The image goes through a CLIP image encoder; the resulting embedding is projected into the transformer's cross-attention key/value space and weighted by the IP-Adapter scale at every block.

v2 (the "Plus" variant — ip-adapter-plus_sdxl_vit-h for SDXL, XLabs-AI/flux-ip-adapter-v2 for FLUX) replaces v1's single projection with a small Q-Former-style transformer that attends across multiple feature levels of the CLIP encoder. The SDXL draft tier loads h94/IP-Adapter Plus weights; the FLUX quality tier loads the XLabs v2 weights via diffusers' FluxIPAdapterMixin.

Watch the dial

Pick a scale value below to see what happens to the generated tile when the seed reference (the warm cobbled strip on the left) is fed in at different strengths. Same prompt, same anchor, same model — only the dial changes.

Seed reference (input)

Generated tile (output)

scale = 0.6 ✓

Sweet spot on FLUX. Style transfers cleanly: palette, brushwork, and lighting match the seed without copying its content.

Image placeholder — IMG_PLACEHOLDER_NEEDED: ipa-dial-real-tiles

Real FLUX tiles at four different IP-Adapter scales would replace or complement the canvas mockup above. See the HTML comment for the suggested file paths.

Two different "optimal" scales for two different stacks

One of the more counter-intuitive findings in the project: the right IP-Adapter strength depends on which model it's plugged into.

SDXL stack — Stage 2

Scale 0.8 is optimal

On the SDXL draft tier (DreamShaper XL Lightning + IP-Adapter Plus), scale 0.8 gave the best palette consistency in Stage 2 testing — +38.6% ΔE improvement vs no adapter. Scale 0.6 was too weak; 1.0 overrode the prompt.

FLUX stack — an earlier iteration

Scale 0.6 is the ceiling

On the FLUX quality tier (FLUX.1 Krea + XLabs IP-Adapter v2), scores peak at 0.6 and decline at higher scales. Iter-6 sweep ([0.6, 0.75, 0.9, 1.0]) confirmed the ceiling. CLIP encodes content too aggressively in the FLUX cross-attention path.

FLUX scale sweep (an earlier iteration iter-6)

ipa=0.4

3.90

Too weak — style barely transfers

ipa=0.6

5.56

Sweet spot (FLUX) — gate crossed

ipa=0.75

4.80

CLIP starts replicating content

ipa=0.9

4.20

Style image dominates the output

ipa=1

3.60

Prompt is overridden — wrong scene

Pass-B VLM judge scores. The 0.6 row is iter-5b (gate-crossing baseline); higher scales come from iter-6.

Why the FLUX ceiling exists

At higher IP-Adapter strengths on FLUX, the reference picture stops being a style cue and starts being a content cue — the seed's actual subject matter starts bleeding into the output. The judge flags it as a style mismatch even though it looks more like the picture is being copied wholesale. The reason is that the way the model encodes images doesn't cleanly separate "how this looks" from "what this is"; both ride in the same signal.

Practical guidance baked into the pipeline: cap FLUX IP-Adapter at 0.6 unless the seed image is intentionally generic (e.g. a small palette-only swatch). For SDXL, 0.8 is fine — its image conditioning is weaker, so a higher number gets the same net effect.

The bootstrapped-seeds dependency

IP-Adapter only works as well as the seed image you feed it. an earlier iteration found that seeds painted by a different generator (Gemini 3.x) caused pipeline-level style drift — the FLUX renderer couldn't reproduce Gemini's brushwork, so IP-Adapter fought itself. Switching to seeds generated by the same FLUX + IPA stack that paints the final tile (the bootstrapped seeds pattern) added +1.45 to the seam score with no other change. That's the largest single win documented of this batch timeline.

Where it lives in the painter

The FLUX backend tier loads XLabs-AI/flux-ip-adapter-v2 and runs at scale = 0.6 by default — the empirical ceiling above which CLIP starts to copy reference content rather than transfer style.

Why same-pipeline seeds matter →