ERNIE-Image Two-Pass Refinement Workflow: Dual-Stage Optimization and ComfyUI Multi-Model Pipeline

mei 30, 2026

ERNIE-Image Two-Pass Refinement Workflow: Dual-Stage Optimization and ComfyUI Multi-Model Pipeline

Community users on Reddit and ComfyUI forums have discovered a powerful workflow pattern: using ERNIE-Image Turbo for first-stage rapid generation, followed by another model (such as Z-Image Turbo or FLUX.2) for second-stage refinement. This article details the principles, practical methods, and ComfyUI implementation of this "two-pass refinement" workflow, helping creators break through single-model quality limits.

What Is a "Two-Pass Refinement" Workflow?

In AI image generation, "Two-Pass/Two-Stage Refinement" means using two different models collaboratively to complete a single generation task:

  1. First Stage (Draft/Composition): Fast model generates base composition, layout, and text
  2. Second Stage (Refinement/Texture): High-quality model performs detail optimization and texture enhancement

This workflow is inspired by traditional painting's "sketch → color" process. In the AI era, different models excel in different areas, and collaboration often outperforms a single model.

Why Two-Pass Refinement?

Single-Model Limitations

Each AI image generation model has unique strengths and weaknesses:

Model Strengths Weaknesses
ERNIE-Image Turbo Precise text rendering, strong composition, 8-step speed Detail texture occasionally lacks refinement
Z-Image Turbo Rich details, strong color expression Weaker text rendering
FLUX.2 Pro Very high image quality, rich LoRA ecosystem Requires 12B params, higher VRAM
SD 3.5 Mature ecosystem, rich community resources Average text rendering and instruction following

Community Discoveries

A Reddit r/comfyui user shared this finding in a popular post:

"I noticed a lot of issues with Ernie image and I decided to test run a few gens with a 2nd pass refinement of ZIT. Results were very good, the text rendering from Ernie + the detail from ZIT makes a powerful combo."

— r/comfyui, "Ernie Image Turbo + Z-Image Turbo 2 Pass Workflow"

Similar findings appeared across multiple community posts:

  • r/civitai: "The workflow I have uses both Z-Image, and then Z-Image turbo as a sort of refiner"
  • r/StableDiffusion: "ERNIE-Image for text + another model for refinement is a game changer"

ERNIE-Image Turbo + Z-Image Turbo Two-Stage Workflow

Workflow Overview

User Prompt Input
       │
       ▼
┌─────────────────────┐
│  ERNIE-Image Turbo   │  ← Stage 1
│  8 Inference Steps   │     - Fast generation
│  Composition + Text  │     - Precise text
│  1024×1024           │     - Output in ~8 seconds
└────────┬────────────┘
         │ (image output)
         ▼
┌─────────────────────┐
│   Z-Image Turbo      │  ← Stage 2
│   img2img Refine     │     - Enhanced details
│   denoise=0.3~0.5   │     - Improved texture
│   1024×1024          │     - Richer colors
└────────┬────────────┘
         │
         ▼
   Final Output Image

Key Parameter Settings

Stage 1 (ERNIE-Image Turbo):

# ERNIE-Image Turbo generation parameters
pipe_ernie = ErnieImageTurboPipeline.from_pretrained(
    "baidu/ERNIE-Image-Turbo", torch_dtype=torch.float16
)
pipe_ernie = pipe_ernie.to("cuda")

output_stage1 = pipe_ernie(
prompt="your prompt here",
height=1024,
width=1024,
num_inference_steps=8, # Turbo needs only 8 steps
guidance_scale=5.0,
).images[0]

Stage 2 (Z-Image Turbo img2img refinement):

# Z-Image Turbo img2img refinement
pipe_zimage = ZImageTurboPipeline.from_pretrained(
    "stabilityai/z-image-turbo", torch_dtype=torch.float16
)
pipe_zimage = pipe_zimage.to("cuda")

output_final = pipe_zimage(
image=output_stage1, # Stage 1 output as input
prompt="your prompt here", # Same prompt
height=1024,
width=1024,
num_inference_steps=4, # Turbo refinement needs fewer steps
guidance_scale=0.0, # img2img usually no CFG needed
denoising_strength=0.35, # Key: 0.3~0.5 preserves composition
).images[0]

Key parameter denoising_strength explained:

  • 0.2~0.3: Light refinement, mainly improves texture and color
  • 0.35~0.45: Medium refinement, improves details while preserving composition
  • 0.5+: Heavy refinement, may alter composition, generally not recommended

Comparison Results

Dimension Single Model (ERNIE-Image) Dual Model (ERNIE + Z-Image)
Text Rendering ✅ Excellent ✅ Excellent (preserved)
Composition ✅ Excellent ✅ Excellent (preserved)
Skin Texture ⚠️ Average ✅ Noticeably improved
Color Depth ⚠️ Average ✅ Richer
Detail Sharpness ⚠️ Average ✅ Sharper
Generation Time ~8 seconds ~12 seconds

ComfyUI Multi-Model Pipeline Implementation

ComfyUI Workflow Structure

Implementing the two-pass refinement workflow in ComfyUI:

{
  "1": {
    "class_type": "CheckpointLoaderSimple",
    "inputs": { "ckpt_name": "ernie-image-turbo.safetensors" }
  },
  "2": {
    "class_type": "CLIPTextEncode",
    "inputs": {
      "text": "your prompt here",
      "clip": ["1", 1]
    }
  },
  "3": {
    "class_type": "EmptyLatentImage",
    "inputs": { "width": 1024, "height": 1024, "batch_size": 1 }
  },
  "4": {
    "class_type": "KSampler",
    "inputs": {
      "model": ["1", 0],
      "positive": ["2", 0],
      "negative": ["2_neg", 0],
      "latent_image": ["3", 0],
      "seed": 42,
      "steps": 8,
      "cfg": 5.0,
      "sampler_name": "euler",
      "scheduler": "normal"
    }
  },
  "5": {
    "class_type": "VAEDecode",
    "inputs": { "samples": ["4", 0], "vae": ["1", 2] }
  },
  "--- Stage 2 ---",
  "10": {
    "class_type": "CheckpointLoaderSimple",
    "inputs": { "ckpt_name": "z-image-turbo.safetensors" }
  },
  "11": {
    "class_type": "LoadImage",
    "inputs": { "image": "stage1_output.png" }
  },
  "12": {
    "class_type": "VAEEncode",
    "inputs": { "pixels": ["11", 0], "vae": ["10", 2] }
  },
  "13": {
    "class_type": "KSampler",
    "inputs": {
      "model": ["10", 0],
      "positive": ["2", 0],
      "negative": ["2_neg", 0],
      "latent_image": ["12", 0],
      "seed": 42,
      "steps": 4,
      "cfg": 0.0,
      "denoise": 0.35
    }
  },
  "14": {
    "class_type": "VAEDecode",
    "inputs": { "samples": ["13", 0], "vae": ["10", 2] }
  },
  "15": {
    "class_type": "SaveImage",
    "inputs": { "images": ["14", 0] }
  }
}

Optimizing with Pixaroma Nodes

Pixaroma provides custom nodes optimized for ERNIE-Image:

  • Pixaroma Note Node: Rich text prompt editing with highlights, lists, code blocks
  • Pixaroma Resolution Node: Quick selection of common resolutions (1024×1024, 1024×1536, etc.)
  • ErnieImage Pipeline Node: One-click loading of complete ERNIE-Image pipeline (including PE)

Installation:

cd ComfyUI/custom_nodes
git clone https://github.com/pixaroma/ComfyUI-Pixaroma
cd ComfyUI-Pixaroma
pip install -r requirements.txt

Advanced: Three-Stage Workflow and ControlNet Integration

Three-Stage Workflow

For scenarios requiring extremely high image quality, expand to three stages:

Stage 1: ERNIE-Image Turbo → Composition + Text (8 steps, ~8 sec)
Stage 2: Z-Image Turbo img2img → Detail Enhancement (4 steps, ~3 sec)
Stage 3: FLUX.2 Pro Upscale → Super-resolution + Final Polish (20 steps, ~20 sec)
Total: ~31 sec, quality far exceeds single-model direct generation

ControlNet Integration

ERNIE-Image's ControlNet support can provide precise composition control in Stage 1:

# ControlNet + ERNIE-Image Turbo Stage 1
from diffusers import ErnieImageControlNetPipeline

pipe_cn = ErnieImageControlNetPipeline.from_pretrained(
"baidu/ERNIE-Image-Turbo",
controlnet=ControlNetModel.from_pretrained("baidu/ERNIE-Image-ControlNet"),
torch_dtype=torch.float16
)

Use Canny edge detection to control composition

canny_condition = apply_canny(image=reference_image, low_threshold=100, high_threshold=200)

output_stage1 = pipe_cn(
prompt="your prompt",
image=canny_condition,
controlnet_conditioning_scale=0.8,
num_inference_steps=8,
guidance_scale=5.0,
).images[0]

IP-Adapter Character Consistency

Combining IP-Adapter for character consistency + two-pass refinement:

Stage 1: ERNIE-Image + IP-Adapter → Character-consistent base image
Stage 2: Z-Image Turbo img2img → Quality enhancement

This combination is ideal for series content creation requiring character consistency, such as comic panels and character design sheets.

Practical Application Scenarios

E-Commerce Product Photography

Prompt: "Professional product photography of a ceramic coffee cup on marble
         table, soft natural lighting, 8K, studio quality"

Stage 1 (ERNIE): Product layout and lighting
Stage 2 (Z-Image): Enhanced product details and material texture
Result: E-commerce-ready product images

Poster Design

Prompt: "Summer music festival poster, 'NEON DREAMS 2026', vibrant colors,
         retro synthwave style, electric guitar silhouette"

Stage 1 (ERNIE): Precise text rendering + poster layout
Stage 2 (Z-Image): Enhanced color saturation and visual impact
Result: Clear text + vibrant poster design

Character Design

Prompt: "Fantasy character design, elven warrior with silver armor,
         forest background, detailed face, cinematic lighting"

Stage 1 (ERNIE + IP-Adapter): Character-consistent base image
Stage 2 (Z-Image): Enhanced skin texture and armor details
Result: Game/anime-ready character design sheets

Performance and Cost Analysis

VRAM Requirements

Configuration Peak VRAM Notes
ERNIE-Image Turbo alone ~12 GB After BF16 quantization
Dual model sequential ~14 GB Release Stage 1 before loading Stage 2
Dual model parallel ~24 GB Both models in VRAM simultaneously
NVFP4 quantization ~5 GB For 24GB VRAM users

Time Cost

Method Total Time Quality
ERNIE-Image Turbo alone ~8 sec ★★★★☆
ERNIE + Z-Image two-pass ~12 sec ★★★★★
ERNIE Base 50 steps ~30 sec ★★★★☆
FLUX.2 Pro 20 steps ~25 sec ★★★★☆

Value Assessment

The extra time cost (~4 seconds) for two-pass workflows yields significant quality improvements:

  • E-commerce product photos: Worth it (quality directly affects conversion)
  • Social media images: Situational (Instagram compression reduces visible differences)
  • Posters/Print: Strongly recommended (large output requires high quality)
  • Rapid prototyping: Not worth it (speed prioritized)

FAQ

Q1: Will Stage 2 always improve quality?

Not necessarily. If Stage 1's composition/text is already satisfactory, too high denoising_strength in Stage 2 may degrade existing quality. Start testing from 0.25.

Q2: Can I use FLUX.2 instead of Z-Image as the refinement model?

Yes. FLUX.2 Pro has higher image quality but requires more VRAM (~16 GB). Recommended for RTX 4090/5090 users.

Q3: Does the two-pass workflow affect text rendering?

Properly set denoising_strength (0.3~0.4) won't significantly affect text. Too high (>0.5) may cause text distortion.

Q4: Can this be done in a browser?

Yes. HuggingFace Spaces and Replicate support multi-model pipelines, but local ComfyUI deployment remains the most flexible option.

Summary

Two-pass refinement workflows represent a new paradigm in AI image generation: instead of pursuing a single model's versatility, collaborate across models to leverage individual strengths.

ERNIE-Image Turbo's text rendering and composition abilities, combined with Z-Image Turbo's or FLUX.2 Pro's quality refinement, can produce final results that exceed any single model. As the ComfyUI ecosystem matures and custom nodes become more abundant, this workflow will become increasingly popular.

For professional creators, mastering multi-model collaboration pipelines is already an essential skill in 2026.


Further Reading:

ERNIE-Image Team