ERNIE-Image Two-Pass Refinement Workflow: Dual-Stage Optimization and ComfyUI Multi-Model Pipeline
Community users on Reddit and ComfyUI forums have discovered a powerful workflow pattern: using ERNIE-Image Turbo for first-stage rapid generation, followed by another model (such as Z-Image Turbo or FLUX.2) for second-stage refinement. This article details the principles, practical methods, and ComfyUI implementation of this "two-pass refinement" workflow, helping creators break through single-model quality limits.
What Is a "Two-Pass Refinement" Workflow?
In AI image generation, "Two-Pass/Two-Stage Refinement" means using two different models collaboratively to complete a single generation task:
- First Stage (Draft/Composition): Fast model generates base composition, layout, and text
- Second Stage (Refinement/Texture): High-quality model performs detail optimization and texture enhancement
This workflow is inspired by traditional painting's "sketch → color" process. In the AI era, different models excel in different areas, and collaboration often outperforms a single model.
Why Two-Pass Refinement?
Single-Model Limitations
Each AI image generation model has unique strengths and weaknesses:
| Model | Strengths | Weaknesses |
|---|---|---|
| ERNIE-Image Turbo | Precise text rendering, strong composition, 8-step speed | Detail texture occasionally lacks refinement |
| Z-Image Turbo | Rich details, strong color expression | Weaker text rendering |
| FLUX.2 Pro | Very high image quality, rich LoRA ecosystem | Requires 12B params, higher VRAM |
| SD 3.5 | Mature ecosystem, rich community resources | Average text rendering and instruction following |
Community Discoveries
A Reddit r/comfyui user shared this finding in a popular post:
"I noticed a lot of issues with Ernie image and I decided to test run a few gens with a 2nd pass refinement of ZIT. Results were very good, the text rendering from Ernie + the detail from ZIT makes a powerful combo."
— r/comfyui, "Ernie Image Turbo + Z-Image Turbo 2 Pass Workflow"
Similar findings appeared across multiple community posts:
- r/civitai: "The workflow I have uses both Z-Image, and then Z-Image turbo as a sort of refiner"
- r/StableDiffusion: "ERNIE-Image for text + another model for refinement is a game changer"
ERNIE-Image Turbo + Z-Image Turbo Two-Stage Workflow
Workflow Overview
User Prompt Input
│
▼
┌─────────────────────┐
│ ERNIE-Image Turbo │ ← Stage 1
│ 8 Inference Steps │ - Fast generation
│ Composition + Text │ - Precise text
│ 1024×1024 │ - Output in ~8 seconds
└────────┬────────────┘
│ (image output)
▼
┌─────────────────────┐
│ Z-Image Turbo │ ← Stage 2
│ img2img Refine │ - Enhanced details
│ denoise=0.3~0.5 │ - Improved texture
│ 1024×1024 │ - Richer colors
└────────┬────────────┘
│
▼
Final Output Image
Key Parameter Settings
Stage 1 (ERNIE-Image Turbo):
# ERNIE-Image Turbo generation parameters
pipe_ernie = ErnieImageTurboPipeline.from_pretrained(
"baidu/ERNIE-Image-Turbo", torch_dtype=torch.float16
)
pipe_ernie = pipe_ernie.to("cuda")
output_stage1 = pipe_ernie(
prompt="your prompt here",
height=1024,
width=1024,
num_inference_steps=8, # Turbo needs only 8 steps
guidance_scale=5.0,
).images[0]
Stage 2 (Z-Image Turbo img2img refinement):
# Z-Image Turbo img2img refinement
pipe_zimage = ZImageTurboPipeline.from_pretrained(
"stabilityai/z-image-turbo", torch_dtype=torch.float16
)
pipe_zimage = pipe_zimage.to("cuda")
output_final = pipe_zimage(
image=output_stage1, # Stage 1 output as input
prompt="your prompt here", # Same prompt
height=1024,
width=1024,
num_inference_steps=4, # Turbo refinement needs fewer steps
guidance_scale=0.0, # img2img usually no CFG needed
denoising_strength=0.35, # Key: 0.3~0.5 preserves composition
).images[0]
Key parameter denoising_strength explained:
- 0.2~0.3: Light refinement, mainly improves texture and color
- 0.35~0.45: Medium refinement, improves details while preserving composition
- 0.5+: Heavy refinement, may alter composition, generally not recommended
Comparison Results
| Dimension | Single Model (ERNIE-Image) | Dual Model (ERNIE + Z-Image) |
|---|---|---|
| Text Rendering | ✅ Excellent | ✅ Excellent (preserved) |
| Composition | ✅ Excellent | ✅ Excellent (preserved) |
| Skin Texture | ⚠️ Average | ✅ Noticeably improved |
| Color Depth | ⚠️ Average | ✅ Richer |
| Detail Sharpness | ⚠️ Average | ✅ Sharper |
| Generation Time | ~8 seconds | ~12 seconds |
ComfyUI Multi-Model Pipeline Implementation
ComfyUI Workflow Structure
Implementing the two-pass refinement workflow in ComfyUI:
{
"1": {
"class_type": "CheckpointLoaderSimple",
"inputs": { "ckpt_name": "ernie-image-turbo.safetensors" }
},
"2": {
"class_type": "CLIPTextEncode",
"inputs": {
"text": "your prompt here",
"clip": ["1", 1]
}
},
"3": {
"class_type": "EmptyLatentImage",
"inputs": { "width": 1024, "height": 1024, "batch_size": 1 }
},
"4": {
"class_type": "KSampler",
"inputs": {
"model": ["1", 0],
"positive": ["2", 0],
"negative": ["2_neg", 0],
"latent_image": ["3", 0],
"seed": 42,
"steps": 8,
"cfg": 5.0,
"sampler_name": "euler",
"scheduler": "normal"
}
},
"5": {
"class_type": "VAEDecode",
"inputs": { "samples": ["4", 0], "vae": ["1", 2] }
},
"--- Stage 2 ---",
"10": {
"class_type": "CheckpointLoaderSimple",
"inputs": { "ckpt_name": "z-image-turbo.safetensors" }
},
"11": {
"class_type": "LoadImage",
"inputs": { "image": "stage1_output.png" }
},
"12": {
"class_type": "VAEEncode",
"inputs": { "pixels": ["11", 0], "vae": ["10", 2] }
},
"13": {
"class_type": "KSampler",
"inputs": {
"model": ["10", 0],
"positive": ["2", 0],
"negative": ["2_neg", 0],
"latent_image": ["12", 0],
"seed": 42,
"steps": 4,
"cfg": 0.0,
"denoise": 0.35
}
},
"14": {
"class_type": "VAEDecode",
"inputs": { "samples": ["13", 0], "vae": ["10", 2] }
},
"15": {
"class_type": "SaveImage",
"inputs": { "images": ["14", 0] }
}
}
Optimizing with Pixaroma Nodes
Pixaroma provides custom nodes optimized for ERNIE-Image:
- Pixaroma Note Node: Rich text prompt editing with highlights, lists, code blocks
- Pixaroma Resolution Node: Quick selection of common resolutions (1024×1024, 1024×1536, etc.)
- ErnieImage Pipeline Node: One-click loading of complete ERNIE-Image pipeline (including PE)
Installation:
cd ComfyUI/custom_nodes
git clone https://github.com/pixaroma/ComfyUI-Pixaroma
cd ComfyUI-Pixaroma
pip install -r requirements.txt
Advanced: Three-Stage Workflow and ControlNet Integration
Three-Stage Workflow
For scenarios requiring extremely high image quality, expand to three stages:
Stage 1: ERNIE-Image Turbo → Composition + Text (8 steps, ~8 sec)
Stage 2: Z-Image Turbo img2img → Detail Enhancement (4 steps, ~3 sec)
Stage 3: FLUX.2 Pro Upscale → Super-resolution + Final Polish (20 steps, ~20 sec)
Total: ~31 sec, quality far exceeds single-model direct generation
ControlNet Integration
ERNIE-Image's ControlNet support can provide precise composition control in Stage 1:
# ControlNet + ERNIE-Image Turbo Stage 1
from diffusers import ErnieImageControlNetPipeline
pipe_cn = ErnieImageControlNetPipeline.from_pretrained(
"baidu/ERNIE-Image-Turbo",
controlnet=ControlNetModel.from_pretrained("baidu/ERNIE-Image-ControlNet"),
torch_dtype=torch.float16
)
Use Canny edge detection to control composition
canny_condition = apply_canny(image=reference_image, low_threshold=100, high_threshold=200)
output_stage1 = pipe_cn(
prompt="your prompt",
image=canny_condition,
controlnet_conditioning_scale=0.8,
num_inference_steps=8,
guidance_scale=5.0,
).images[0]
IP-Adapter Character Consistency
Combining IP-Adapter for character consistency + two-pass refinement:
Stage 1: ERNIE-Image + IP-Adapter → Character-consistent base image
Stage 2: Z-Image Turbo img2img → Quality enhancement
This combination is ideal for series content creation requiring character consistency, such as comic panels and character design sheets.
Practical Application Scenarios
E-Commerce Product Photography
Prompt: "Professional product photography of a ceramic coffee cup on marble
table, soft natural lighting, 8K, studio quality"
Stage 1 (ERNIE): Product layout and lighting
Stage 2 (Z-Image): Enhanced product details and material texture
Result: E-commerce-ready product images
Poster Design
Prompt: "Summer music festival poster, 'NEON DREAMS 2026', vibrant colors,
retro synthwave style, electric guitar silhouette"
Stage 1 (ERNIE): Precise text rendering + poster layout
Stage 2 (Z-Image): Enhanced color saturation and visual impact
Result: Clear text + vibrant poster design
Character Design
Prompt: "Fantasy character design, elven warrior with silver armor,
forest background, detailed face, cinematic lighting"
Stage 1 (ERNIE + IP-Adapter): Character-consistent base image
Stage 2 (Z-Image): Enhanced skin texture and armor details
Result: Game/anime-ready character design sheets
Performance and Cost Analysis
VRAM Requirements
| Configuration | Peak VRAM | Notes |
|---|---|---|
| ERNIE-Image Turbo alone | ~12 GB | After BF16 quantization |
| Dual model sequential | ~14 GB | Release Stage 1 before loading Stage 2 |
| Dual model parallel | ~24 GB | Both models in VRAM simultaneously |
| NVFP4 quantization | ~5 GB | For 24GB VRAM users |
Time Cost
| Method | Total Time | Quality |
|---|---|---|
| ERNIE-Image Turbo alone | ~8 sec | ★★★★☆ |
| ERNIE + Z-Image two-pass | ~12 sec | ★★★★★ |
| ERNIE Base 50 steps | ~30 sec | ★★★★☆ |
| FLUX.2 Pro 20 steps | ~25 sec | ★★★★☆ |
Value Assessment
The extra time cost (~4 seconds) for two-pass workflows yields significant quality improvements:
- E-commerce product photos: Worth it (quality directly affects conversion)
- Social media images: Situational (Instagram compression reduces visible differences)
- Posters/Print: Strongly recommended (large output requires high quality)
- Rapid prototyping: Not worth it (speed prioritized)
FAQ
Q1: Will Stage 2 always improve quality?
Not necessarily. If Stage 1's composition/text is already satisfactory, too high denoising_strength in Stage 2 may degrade existing quality. Start testing from 0.25.
Q2: Can I use FLUX.2 instead of Z-Image as the refinement model?
Yes. FLUX.2 Pro has higher image quality but requires more VRAM (~16 GB). Recommended for RTX 4090/5090 users.
Q3: Does the two-pass workflow affect text rendering?
Properly set denoising_strength (0.3~0.4) won't significantly affect text. Too high (>0.5) may cause text distortion.
Q4: Can this be done in a browser?
Yes. HuggingFace Spaces and Replicate support multi-model pipelines, but local ComfyUI deployment remains the most flexible option.
Summary
Two-pass refinement workflows represent a new paradigm in AI image generation: instead of pursuing a single model's versatility, collaborate across models to leverage individual strengths.
ERNIE-Image Turbo's text rendering and composition abilities, combined with Z-Image Turbo's or FLUX.2 Pro's quality refinement, can produce final results that exceed any single model. As the ComfyUI ecosystem matures and custom nodes become more abundant, this workflow will become increasingly popular.
For professional creators, mastering multi-model collaboration pipelines is already an essential skill in 2026.
Further Reading:
- Reddit: Ernie Image Turbo + Z-Image Turbo 2 Pass Workflow
- Pixaroma ComfyUI Nodes: ComfyUI-Pixaroma
- ComfyUI Official ERNIE-Image Tutorial: docs.comfy.org
- Related: EI-012 ComfyUI Workflow Setup, EI-016 IP-Adapter Guide