ERNIE-Image ControlNet Practical Guide: Precise Composition Control with Canny, Depth, and Pose
Publish Date: 2026-05-10
Keywords: ernie-image controlnet, ernie-image canny, ernie-image depth, ernie-image pose, ComfyUI controlnet tutorial
Introduction
ERNIE-Image's text rendering and structured layout capabilities have already set it apart in the open-source text-to-image landscape. But for professional creators, there's one more critical need: precise control over image composition.
ControlNet is the solution — it lets you guide AI generation through edge maps, depth maps, or pose maps, maintaining structural constraints while freely changing styles. However, as of May 2026, ERNIE-Image's ControlNet support is still in community development stages, lacking official documentation.
This article fills that gap, walking you through building an ERNIE-Image ControlNet workflow in ComfyUI, covering three core modes — Canny (edges), Depth (spatial layout), and Pose (body posture) — plus parameter tuning tips and troubleshooting.
1. What is ControlNet? Why Do You Need It?
1.1 Core Concept
ControlNet, proposed by Stanford researchers, injects structural priors into diffusion models:
- Input: A reference image (sketch, photo, lineart, etc.)
- Preprocessing: Extract structural information (edges, depth, pose)
- Guidance: Maintain structural constraints during diffusion
- Output: AI-generated image respecting structural constraints
1.2 Three Core Modes
| Mode | Input | Controls | Typical Use Case |
|---|---|---|---|
| Canny | Canny edge map | Outlines, lines, structural boundaries | Lineart coloring, sketch rendering, architecture reproduction |
| Depth | Depth map | Spatial layout, foreground-background relationships | Scene reconstruction, style transfer preserving composition |
| Pose | Pose map (skeleton) | Body posture, joint positions | Character pose control, character redesign |
2. ComfyUI Workflow Setup
2.1 Environment Preparation
# Install ComfyUI
git clone https://github.com/comfyanonymous/ComfyUI.git
cd ComfyUI
pip install -r requirements.txt
Download ERNIE-Image model
From HuggingFace baidu/ERNIE-Image, download diffusion_model.safetensors
Place in ComfyUI/models/diffusion_models/
Download VAE
From HuggingFace, download VAE files
Place in ComfyUI/models/vae/
Download ControlNet models
Recommended: DiffSynth ControlNet Patch or InstantX ControlNet Union
Place in ComfyUI/models/controlnet/
2.2 Core Node Connections
A standard ERNIE-Image ControlNet workflow contains these node chains:
[Checkpoint Loader] → [CLIP Text Encode (Positive)]
→ [CLIP Text Encode (Negative)]
→ [ControlNet Loader] ← [Load Image (control image)]
→ [ControlNet Apply (DiT)]
→ [KSampler] → [VAE Decode] → [Save Image]
Node Configuration Guide:
| Node | Key Parameter | Recommended Value |
|---|---|---|
| Checkpoint Loader | model_name | baidu_ERNIE-Image |
| ControlNet Loader | control_net_name | ernie_image_canny / ernie_image_depth / ernie_image_pose |
| ControlNet Apply (DiT) | strength | 0.3 ~ 0.8 (adjust per mode) |
| KSampler | steps | 50 (Standard) / 8 (Turbo) |
| KSampler | cfg | 4.0 ~ 7.0 |
| KSampler | denoise | 0.7 ~ 1.0 |
2.3 Canny Mode: Lineart Coloring & Sketch Rendering
Use Case: You have a sketch or lineart and want ERNIE-Image to transform it into a beautiful rendered image.
Workflow Steps:
- Load your lineart/sketch into the
Load Imagenode - Connect
Canny Edge Preprocessornode (low threshold 50, high threshold 200) - Connect
ControlNet Loaderwith Canny ControlNet model - Connect
ControlNet Apply (DiT)node - Write a prompt describing your desired final effect
- Run KSampler
Recommended Prompts:
# Architectural lineart → photorealistic render
"A photorealistic rendering of a modern glass skyscraper at sunset,
golden hour lighting, reflections on glass facade, urban environment,
architectural photography, 8K quality"
Anime lineart → coloring
"Colorful anime style character illustration, vibrant colors,
detailed shading, cel-shaded, studio-quality anime art"
Strength Tuning:
- 0.3~0.5: Maintain rough outline, AI has significant creative freedom
- 0.5~0.7: Closely follow lineart structure
- 0.7~0.8: Strict adherence, but may limit AI creativity
2.4 Depth Mode: Scene Reconstruction & Style Transfer
Use Case: You want to maintain the spatial layout of a reference photo while changing the overall style.
Workflow Steps:
- Load reference photo into
Load Imagenode - Connect
Depth Anything PreprocessororZoe Depth Preprocessor - Connect ControlNet Loader with Depth ControlNet model
- Write a prompt describing the target style
- Run KSampler
Recommended Prompts:
# Photo → oil painting
"Oil painting of a mountain landscape, Van Gogh style,
thick brushstrokes, vibrant colors, dramatic sky,
Impressionist masterpiece"
Interior photo → cyberpunk
"Cyberpunk interior scene, neon lights, rain-streaked windows,
holographic displays, dark atmosphere, futuristic city view,
blade runner style"
Strength Tuning:
- 0.4~0.6: Maintain spatial layout, allow significant style changes
- 0.6~0.8: Closely follow original composition
2.5 Pose Mode: Body Posture Control
Use Case: You need precise control over character poses, like character redesign or pose replacement.
Workflow Steps:
- Load reference character photo into
Load Imagenode - Connect
OpenPose Preprocessornode - Connect ControlNet Loader with Pose ControlNet model
- Write a prompt describing the target character/style
- Run KSampler
Recommended Prompts:
# Pose replacement (change costume/style)
"A female warrior in elaborate fantasy armor,
dynamic battle pose, fantasy art style,
detailed metallic textures, dramatic lighting,
epic composition"
Character design
"A cyberpunk street samurai, neon-lit rain-soaked alley,
detailed futuristic clothing, katana at side,
cinematic composition, movie poster style"
Strength Tuning:
- 0.3~0.5: Roughly follow pose, AI adjusts details
- 0.5~0.7: Precise joint position adherence
3. Multi-ControlNet Stacking
Advanced technique: Combine multiple ControlNet modes for finer control.
3.1 Canny + Depth Combination
Scenario: Maintain both edge structure and spatial depth.
[Load Image] → [Canny Preprocessor] → [ControlNet Apply (Canny)]
[Load Image] → [Depth Preprocessor] → [ControlNet Apply (Depth)]
↓
[ControlNet Merge] → [KSampler]
Parameter Suggestions:
- Canny strength: 0.4
- Depth strength: 0.5
3.2 Pose + Canny Combination
Scenario: Precisely control both body posture and costume outline.
[Load Image (pose ref)] → [OpenPose Preprocessor] → [ControlNet Apply (Pose)]
[Load Image (lineart)] → [Canny Preprocessor] → [ControlNet Apply (Canny)]
↓
[ControlNet Merge] → [KSampler]
4. Troubleshooting
Q1: ControlNet effect is not obvious
Cause: Strength parameter too low or ControlNet model incompatible with base model.
Solution:
- Gradually increase strength (test from 0.3 to 0.8)
- Confirm ControlNet model is designed for ERNIE-Image architecture
- Try DiffSynth ControlNet Patch instead of InstantX Union
Q2: Generated images show artifacts or distortion
Cause: Strength too high or poor quality control image.
Solution:
- Lower strength below 0.5
- Use high-quality control images (clear, high resolution)
- Increase CFG value to 6.0~7.0
Q3: Text rendering fails under ControlNet
Cause: Structural constraints from ControlNet compete with text rendering.
Solution:
- Use lower strength (0.3~0.4)
- Explicitly mark text content with quotes in prompt
- Consider generating ControlNet base composition first, then add text via inpainting
Q4: Out of VRAM
Cause: ControlNet model + base model loaded simultaneously doubles VRAM needs.
Solution:
- Use INT8 quantized ERNIE-Image model (~10GB VRAM)
- Use NVFP4 quantization (~5GB VRAM)
- Disable PE (Prompt Enhancer) to save VRAM
5. Best Practices Summary
- Start with low strength: Begin at 0.3, gradually increase until satisfied
- Control image quality is critical: Clear edge/depth/pose maps produce better results
- Multi-mode combinations are advanced techniques: Canny+Depth or Pose+Canny enable fine-grained control
- Turbo mode compatibility: ControlNet effects slightly degrade in Turbo (8-step) mode; Standard (50-step) recommended
- Separate text and ControlNet: For text rendering, generate ControlNet base composition first, then add text via inpainting
This article is based on ComfyUI community practices and ERNIE-Image DiT architecture characteristics. The ControlNet ecosystem is rapidly evolving — check HuggingFace and ComfyUI communities regularly for the latest model updates.