ERNIE-Image Character LoRA Training Complete Guide: From Dataset Preparation to ComfyUI Deployment
Abstract: Character consistency is the core pain point in AI image generation. This article provides a deep dive into the complete ERNIE-Image LoRA character training workflow — from high-quality dataset construction to training parameter tuning and ComfyUI deployment, ensuring your AI character maintains identity across different scenes, angles, and expressions.
Why Character LoRA Is the Watershed Moment for AI Image Generation
In the AI image generation space, character consistency has been the single biggest challenge. Have you ever experienced these scenarios:
- The first character image looks perfect, but the second one looks completely different
- Different expressions and angles of the same character appear as entirely different people
- After training a LoRA, the character's features are "eaten away," resulting in identical outputs
ERNIE-Image's 8B DiT architecture excels at LoRA training. A Reddit user noted: "Unlike Z-Image Turbo, ERNIE-Image seems to be really good for LoRA training" — a sentiment that resonates widely in the community.
Part 1: Core Concepts of Character LoRA Training
What is LoRA?
LoRA (Low-Rank Adaptation) is a parameter-efficient fine-tuning technique. By injecting low-rank matrices into the attention layers of a pre-trained model, it adapts to specific styles or characters using only 1-10% of the original parameters.
ERNIE-Image LoRA Training Advantages:
- Moderate parameter count: 8B DiT architecture balances training cost and output quality
- Text rendering preserved: Character LoRA training doesn't break ERNIE-Image's text rendering capability
- Layout compatibility: Character LoRA works alongside ERNIE-Image's poster/infographic generation
Character LoRA vs IP-Adapter vs Reference-Only
| Method | Identity Retention | Flexibility | Training Cost | Best Use Case |
|---|---|---|---|---|
| Character LoRA | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | Medium | Fixed character, multi-scene reuse |
| IP-Adapter | ⭐⭐⭐ | ⭐⭐⭐⭐⭐ | No training | Temporary character reference |
| Reference-Only | ⭐⭐ | ⭐⭐⭐⭐ | No training | Style transfer primarily |
Bottom line: If you need the same character across 10+ images in different scenes, character LoRA is your only reliable option.
Part 2: Dataset Preparation — 80% of LoRA Quality
The Golden Rule: 15-30 High-Quality Images
Community consensus from Reddit: "25-30 images should give you good results for lora training." Too few leads to underfitting; too many causes overfitting.
Dataset Construction Checklist
Required angles (at least 2-3 per category):
- Front portrait — Clear face, neutral expression
- Side/3/4 profile — Shows character silhouette
- Full body — Shows proportions, clothing style
- Different expressions — Smile, serious, surprised, etc.
- Different lighting — Natural, indoor, backlit
Image quality standards:
- Resolution ≥ 512×512 (recommended 768×768 or higher)
- No watermarks, blur, or excessive filters
- Character occupies at least 50% of the frame
- Simple backgrounds to avoid training interference
❌ Images to avoid:
- Group shots (multiple characters)
- Extreme angles (bird's-eye, low-angle >45°)
- Distorted images (wide-angle distortion)
- Mixed in images of different characters
Caption Generation Strategy
Method 1: General Description + Trigger Word
Trigger word: {mycharacter} a young woman, [detailed clothing/features description], [scene description], [lighting description]
Example:
{mycharacter} a young Asian woman with long black hair, wearing a red qipao, standing in a traditional Chinese garden, soft morning light
Method 2: Layered Description (Recommended)
Trigger word + Fixed character traits + Variable descriptions (scene, expression, clothing)
This approach allows post-training control through the variable portion while keeping character identity stable.
Part 3: Training Parameter Deep Dive
Base Configuration
| Parameter | Recommended | Notes |
|---|---|---|
| Learning Rate | 1e-4 ~ 5e-5 | Lower for character LoRA |
| Network Rank | 32 ~ 64 | Characters need higher rank |
| Network Alpha | 16 ~ 32 | Typically half of rank |
| Epochs | 10 ~ 20 | Too many = overfitting |
| Batch Size | 1 ~ 4 | Depends on VRAM |
| Optimizer | AdamW8bit | VRAM-friendly |
| Dataset Repeats | 10 ~ 20 | Adjust based on dataset size |
| Resolution | 512 or 768 | Match training target |
ERNIE-Image Specific Parameters
# ERNIE-Image LoRA Training Configuration
model: baidu/ERNIE-Image
base_model_revision: main
network_dim: 64
network_alpha: 32
learning_rate: 1e-4
lr_scheduler: cosine
lr_warmup_steps: 100
max_train_steps: 2000
mixed_precision: bf16
train_batch_size: 2
resolution: 768
Overfitting Detection
Good training signals:
- Loss drops rapidly in first 1000 steps, then stabilizes
- Character features gradually become clearer
- Consistent character across different prompts
Overfitting signals:
- Loss continues dropping but quality degrades
- All generated images look identical
- Character features "consume" background details
Solutions: Early stopping, reduce epochs, add regularization images.
Part 4: ComfyUI Deployment Workflow
Basic ERNIE-Image + Character LoRA Workflow
[Load Checkpoint: ERNIE-Image Base]
↓
[Load LoRA: character_lora.safetensors] → weight: 0.8~1.0
↓
[CLIP Text Encode] → positive prompt
[CLIP Text Encode] → negative prompt
↓
[Empty Latent Image] → 768x768
↓
[Sampler] → DPM++ 2M Karras, 20-30 steps, CFG 5-7
↓
[VAE Decode]
↓
[Save Image]
Advanced: Character LoRA + ControlNet Combo
[Character LoRA] → Maintain character identity
[ControlNet: OpenPose] → Control character pose
[ControlNet: Canny] → Control composition
→ Stack all three for precise character control
Advanced: Character LoRA + IP-Adapter Style Transfer
[Character LoRA] → Character identity consistent
[IP-Adapter] → Reference specific art style
→ Character stays the same, style changes
Part 5: Common Pitfalls and Solutions
Pitfall 1: LoRA Weight Too High
Symptom: Character features too strong, rigid images
Solution: Lower LoRA weight to 0.6-0.8, or adjust CFG scale
Pitfall 2: Insufficient Dataset Diversity
Symptom: Character only generates in specific scenes/angles
Solution: Add multi-angle, multi-expression training data
Pitfall 3: Trigger Word Pollution
Symptom: Character generates without trigger word — model memorized character instead of trigger
Solution: Ensure every training image uses the trigger word, never use real character names
Pitfall 4: Overtraining Reduces Generalization
Symptom: All generated characters look identical
Solution: Reduce training steps, add regularization images (similar but different characters)
Part 6: Practical Case — Training an Anime Character LoRA
Dataset (20 images)
- Front portraits × 5 (different expressions)
- Side/3/4 profiles × 4
- Full body × 5 (different angles)
- Half body × 3
- Close-ups (eyes/hands) × 3
Training Configuration
network_dim: 64
network_alpha: 32
learning_rate: 1e-4
max_train_steps: 2500
epochs: 15
Test Prompts
Front: {mycharacter} a beautiful anime girl, long silver hair, blue eyes, smiling, white dress, fantasy background, cinematic lighting
Side: {mycharacter} side view, looking over shoulder, sunset lighting, long silver hair flowing in wind
Full body: {mycharacter} full body, standing in a magical forest, glowing particles, dynamic pose, detailed fantasy outfit
Part 7: ERNIE-Image Character LoRA vs Competitors
| Dimension | ERNIE-Image | Flux.2 | SDXL |
|---|---|---|---|
| Character Consistency | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ |
| Text Rendering | ⭐⭐⭐⭐⭐ | ⭐⭐⭐ | ⭐⭐ |
| Training Cost | ⭐⭐⭐ | ⭐⭐ | ⭐⭐⭐⭐ |
| Community Resources | ⭐⭐ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ |
| Commercial License | Apache 2.0 ✅ | Apache 2.0 ✅ | Open RAIL++ |
ERNIE-Image's unique edge: After character LoRA training, text rendering capability is preserved. You can generate posters and infographics with the character's name — something other models struggle with.
Summary
Character consistency LoRA training is the key step that transforms AI image generation from "toy" to "tool." ERNIE-Image, with its 8B DiT architecture and Apache 2.0 open-source license, provides a highly cost-effective platform for character LoRA training.
Key takeaways:
- Dataset quality > training parameters — 15-30 high-quality, multi-angle images are fundamental
- Overfitting is the most common trap — use early stopping and regularization data
- ComfyUI workflows enable rapid deployment of training results
- ERNIE-Image's text rendering + character LoRA is a unique combination advantage
References: Reddit r/StableDiffusion, r/ComfyUI, HuggingFace baidu/ERNIE-Image, dev.to LoRA Training Guide, RunDiffusion Character Consistency Template