ERNIE-Image ControlNet Practical Guide: Precise Composition Control with Canny, Depth, and Pose

mag 10, 2026

ERNIE-Image ControlNet Practical Guide: Precise Composition Control with Canny, Depth, and Pose

Publish Date: 2026-05-10
Keywords: ernie-image controlnet, ernie-image canny, ernie-image depth, ernie-image pose, ComfyUI controlnet tutorial


Introduction

ERNIE-Image's text rendering and structured layout capabilities have already set it apart in the open-source text-to-image landscape. But for professional creators, there's one more critical need: precise control over image composition.

ControlNet is the solution — it lets you guide AI generation through edge maps, depth maps, or pose maps, maintaining structural constraints while freely changing styles. However, as of May 2026, ERNIE-Image's ControlNet support is still in community development stages, lacking official documentation.

This article fills that gap, walking you through building an ERNIE-Image ControlNet workflow in ComfyUI, covering three core modes — Canny (edges), Depth (spatial layout), and Pose (body posture) — plus parameter tuning tips and troubleshooting.


1. What is ControlNet? Why Do You Need It?

1.1 Core Concept

ControlNet, proposed by Stanford researchers, injects structural priors into diffusion models:

  1. Input: A reference image (sketch, photo, lineart, etc.)
  2. Preprocessing: Extract structural information (edges, depth, pose)
  3. Guidance: Maintain structural constraints during diffusion
  4. Output: AI-generated image respecting structural constraints

1.2 Three Core Modes

Mode Input Controls Typical Use Case
Canny Canny edge map Outlines, lines, structural boundaries Lineart coloring, sketch rendering, architecture reproduction
Depth Depth map Spatial layout, foreground-background relationships Scene reconstruction, style transfer preserving composition
Pose Pose map (skeleton) Body posture, joint positions Character pose control, character redesign

2. ComfyUI Workflow Setup

2.1 Environment Preparation

# Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt

Download ERNIE-Image model

From HuggingFace baidu/ERNIE-Image, download diffusion_model.safetensors

Place in ComfyUI/models/diffusion_models/

Download VAE

From HuggingFace, download VAE files

Place in ComfyUI/models/vae/

Download ControlNet models

Recommended: DiffSynth ControlNet Patch or InstantX ControlNet Union

Place in ComfyUI/models/controlnet/

2.2 Core Node Connections

A standard ERNIE-Image ControlNet workflow contains these node chains:

[Checkpoint Loader] → [CLIP Text Encode (Positive)]
                     → [CLIP Text Encode (Negative)]
                     → [ControlNet Loader] ← [Load Image (control image)]
                     → [ControlNet Apply (DiT)]
                     → [KSampler] → [VAE Decode] → [Save Image]

Node Configuration Guide:

Node Key Parameter Recommended Value
Checkpoint Loader model_name baidu_ERNIE-Image
ControlNet Loader control_net_name ernie_image_canny / ernie_image_depth / ernie_image_pose
ControlNet Apply (DiT) strength 0.3 ~ 0.8 (adjust per mode)
KSampler steps 50 (Standard) / 8 (Turbo)
KSampler cfg 4.0 ~ 7.0
KSampler denoise 0.7 ~ 1.0

2.3 Canny Mode: Lineart Coloring & Sketch Rendering

Use Case: You have a sketch or lineart and want ERNIE-Image to transform it into a beautiful rendered image.

Workflow Steps:

  1. Load your lineart/sketch into the Load Image node
  2. Connect Canny Edge Preprocessor node (low threshold 50, high threshold 200)
  3. Connect ControlNet Loader with Canny ControlNet model
  4. Connect ControlNet Apply (DiT) node
  5. Write a prompt describing your desired final effect
  6. Run KSampler

Recommended Prompts:

# Architectural lineart → photorealistic render
"A photorealistic rendering of a modern glass skyscraper at sunset,
golden hour lighting, reflections on glass facade, urban environment,
architectural photography, 8K quality"

Anime lineart → coloring

"Colorful anime style character illustration, vibrant colors,
detailed shading, cel-shaded, studio-quality anime art"

Strength Tuning:

  • 0.3~0.5: Maintain rough outline, AI has significant creative freedom
  • 0.5~0.7: Closely follow lineart structure
  • 0.7~0.8: Strict adherence, but may limit AI creativity

2.4 Depth Mode: Scene Reconstruction & Style Transfer

Use Case: You want to maintain the spatial layout of a reference photo while changing the overall style.

Workflow Steps:

  1. Load reference photo into Load Image node
  2. Connect Depth Anything Preprocessor or Zoe Depth Preprocessor
  3. Connect ControlNet Loader with Depth ControlNet model
  4. Write a prompt describing the target style
  5. Run KSampler

Recommended Prompts:

# Photo → oil painting
"Oil painting of a mountain landscape, Van Gogh style,
thick brushstrokes, vibrant colors, dramatic sky,
Impressionist masterpiece"

Interior photo → cyberpunk

"Cyberpunk interior scene, neon lights, rain-streaked windows,
holographic displays, dark atmosphere, futuristic city view,
blade runner style"

Strength Tuning:

  • 0.4~0.6: Maintain spatial layout, allow significant style changes
  • 0.6~0.8: Closely follow original composition

2.5 Pose Mode: Body Posture Control

Use Case: You need precise control over character poses, like character redesign or pose replacement.

Workflow Steps:

  1. Load reference character photo into Load Image node
  2. Connect OpenPose Preprocessor node
  3. Connect ControlNet Loader with Pose ControlNet model
  4. Write a prompt describing the target character/style
  5. Run KSampler

Recommended Prompts:

# Pose replacement (change costume/style)
"A female warrior in elaborate fantasy armor,
dynamic battle pose, fantasy art style,
detailed metallic textures, dramatic lighting,
epic composition"

Character design

"A cyberpunk street samurai, neon-lit rain-soaked alley,
detailed futuristic clothing, katana at side,
cinematic composition, movie poster style"

Strength Tuning:

  • 0.3~0.5: Roughly follow pose, AI adjusts details
  • 0.5~0.7: Precise joint position adherence

3. Multi-ControlNet Stacking

Advanced technique: Combine multiple ControlNet modes for finer control.

3.1 Canny + Depth Combination

Scenario: Maintain both edge structure and spatial depth.

[Load Image] → [Canny Preprocessor] → [ControlNet Apply (Canny)]
[Load Image] → [Depth Preprocessor] → [ControlNet Apply (Depth)]
                                                     ↓
                                            [ControlNet Merge] → [KSampler]

Parameter Suggestions:

  • Canny strength: 0.4
  • Depth strength: 0.5

3.2 Pose + Canny Combination

Scenario: Precisely control both body posture and costume outline.

[Load Image (pose ref)] → [OpenPose Preprocessor] → [ControlNet Apply (Pose)]
[Load Image (lineart)] → [Canny Preprocessor] → [ControlNet Apply (Canny)]
                                                                  ↓
                                                         [ControlNet Merge] → [KSampler]

4. Troubleshooting

Q1: ControlNet effect is not obvious

Cause: Strength parameter too low or ControlNet model incompatible with base model.
Solution:

  • Gradually increase strength (test from 0.3 to 0.8)
  • Confirm ControlNet model is designed for ERNIE-Image architecture
  • Try DiffSynth ControlNet Patch instead of InstantX Union

Q2: Generated images show artifacts or distortion

Cause: Strength too high or poor quality control image.
Solution:

  • Lower strength below 0.5
  • Use high-quality control images (clear, high resolution)
  • Increase CFG value to 6.0~7.0

Q3: Text rendering fails under ControlNet

Cause: Structural constraints from ControlNet compete with text rendering.
Solution:

  • Use lower strength (0.3~0.4)
  • Explicitly mark text content with quotes in prompt
  • Consider generating ControlNet base composition first, then add text via inpainting

Q4: Out of VRAM

Cause: ControlNet model + base model loaded simultaneously doubles VRAM needs.
Solution:

  • Use INT8 quantized ERNIE-Image model (~10GB VRAM)
  • Use NVFP4 quantization (~5GB VRAM)
  • Disable PE (Prompt Enhancer) to save VRAM

5. Best Practices Summary

  1. Start with low strength: Begin at 0.3, gradually increase until satisfied
  2. Control image quality is critical: Clear edge/depth/pose maps produce better results
  3. Multi-mode combinations are advanced techniques: Canny+Depth or Pose+Canny enable fine-grained control
  4. Turbo mode compatibility: ControlNet effects slightly degrade in Turbo (8-step) mode; Standard (50-step) recommended
  5. Separate text and ControlNet: For text rendering, generate ControlNet base composition first, then add text via inpainting

This article is based on ComfyUI community practices and ERNIE-Image DiT architecture characteristics. The ControlNet ecosystem is rapidly evolving — check HuggingFace and ComfyUI communities regularly for the latest model updates.

ERNIE-Image Team

ERNIE-Image ControlNet Practical Guide: Precise Composition Control with Canny, Depth, and Pose | Blog