ERNIE-Image to AI Video ComfyUI Workflow: Complete Pipeline from Static Images to Animation
Abstract: AI video generation is moving from experimental to practical. This article details how to use ERNIE-Image as a high-quality static image generator, seamlessly connected with LTX 2.3, Wan 2.2 and other video generation models in ComfyUI, building a complete "text-to-image → image-to-video" automated pipeline.
Why ERNIE-Image Is the Best Starting Point for Image-to-Video
In AI video generation workflows, the quality of the first frame determines the ceiling of the final video. ERNIE-Image is an ideal choice for image-to-video scenarios in three key dimensions:
- High-quality static image generation: 8B DiT architecture, ranking among top open-source models in character consistency, scene detail, and lighting
- Text rendering capability: Accurate text can be embedded in static images, remaining readable after video generation
- Native ComfyUI support: RunComfy and mimicpc communities already provide complete ERNIE-Image ComfyUI workflows
Part 1: AI Image-to-Video Ecosystem Overview
Major Image-to-Video Models in 2026
| Model | Parameters | Best Use Case | Duration | Resolution |
|---|---|---|---|---|
| LTX 2.3 | Open Source | General I2V, lip sync | 5-10s | 720p+ |
| Wan 2.2 | Open Source | Cinematic motion, smooth transitions | 5-10s | 1080p |
| Higgsfield | Closed Source | Professional animation | 10-30s | 4K |
| Kling 1.5 | Closed API | High realism | 10s | 1080p |
ERNIE-Image + Video Model Advantage
ERNIE-Image (static image) → LTX 2.3 / Wan 2.2 (video generation) → Post-processing (upscaling/color grading)
Why not just use text input directly on video models?
- Video generation models' text understanding is generally weaker than dedicated text-to-image models
- ERNIE-Image's text rendering and structured layout capabilities are unmatched
- Two-stage workflows allow for human review and modification between stages
Part 2: ComfyUI Workflow Setup
Basic Workflow: ERNIE-Image → LTX 2.3 Image-to-Video
┌─────────────────────┐
│ ERNIE-Image Section │
├─────────────────────┤
│ 1. Load Checkpoint │
│ → ERNIE-Image │
│ Base │
│ 2. CLIP Text Encode │
│ → Scene description│
│ 3. Sampler │
│ → 20-30 steps │
│ 4. VAE Decode │
│ → Static output │
└────────┬────────────┘
↓
┌────────┴────────────┐
│ LTX 2.3 Section │
├─────────────────────┤
│ 5. Load Checkpoint │
│ → LTX 2.3 │
│ 6. Image → Video │
│ → Input: ERNIE │
│ output image │
│ 7. Video Sampler │
│ → 25 frames │
│ 8. Video Decode │
│ → MP4 output │
└─────────────────────┘
Key Node Configurations
ERNIE-Image Section:
Checkpoint: baidu/ERNIE-Image Base
Sampler: DPM++ 2M Karras
Steps: 25
CFG: 6.0
Resolution: 768x768 (landscape) or 768x1024 (portrait)
LTX 2.3 Section:
Model: LTX-Video 2.3
First Frame: ERNIE-Image output
Motion Strength: 0.6-0.8 (adjust per scene)
Frames: 25 (~5 seconds @ 5fps)
FPS: 8-12 (video generation frame rate)
Part 3: Advanced Workflows — Multi-Model Combos
Workflow A: ERNIE-Image → LTX 2.3 → Topaz Upscale
ERNIE-Image (768x768) → LTX 2.3 (720p video) → Topaz Video AI (2K/4K)
Suitable for commercial scenarios requiring high-resolution output, such as product showcase videos and advertising materials.
Workflow B: ERNIE-Image → Wan 2.2 Cinematic Motion
ERNIE-Image (character/scene) → Wan 2.2 (60FPS smooth motion) → CapCut editing
Wan 2.2 excels in motion smoothness, ideal for character animation and product showcases.
Workflow C: ERNE-Image → LTX 2.3 Lip Sync
ERNIE-Image (character portrait) → LTX 2.3 IC-LoRA (lip sync) → Audio alignment
LTX 2.3's IC-LoRA supports precise lip synchronization, suitable for AI avatars and virtual presenters.
Part 4: Practical Cases
Case 1: E-commerce Product Video
Requirement: Generate product showcase videos in different settings
Workflow:
ERNIE-Image generates product shot:
Prompt: White ceramic coffee cup, on weathered oak table, morning kitchen setting, soft window light from left, warm tones, shallow depth of field, 8K commercial photographyLTX 2.3 adds motion:
- Steam rising effect
- Slow light changes
- Subtle camera movement
Topaz upscale to 2K
Result: A 5-second 2K product video ready for e-commerce pages
Case 2: AI Anime Short Film
Requirement: Create a 30-second AI anime short
Workflow:
- ERNIE-Image generates keyframes (5 scenes × 2 images = 10 static images)
- LTX 2.3 generates 5-second transitions per frame
- CapCut assembly + voiceover + subtitles
Key Prompt Examples:
Scene 1: Anime style, young woman standing under cherry blossom tree,
pink petals falling, soft light, shallow depth of field,
cinematic composition
Scene 2: Anime style, same character, walking down school corridor,
sunlight streaming through windows, scattered books
Case 3: Social Media Short Video
Requirement: Instagram/TikTok style portrait short videos
Workflow:
ERNIE-Image portrait generation:
Resolution: 768x1024 (9:16) Prompt: Cinematic night scene, city skyline, neon lights, cyberpunk style, portrait compositionWan 2.2 adds dynamic effects:
- Flickering lights
- Moving clouds
- Slow camera push-in
CapCut adds music and captions
Part 5: Common Issues and Optimization
Q1: Text is blurry in videos?
Cause: Video generation models tend to blur text
Solutions:
- Ensure text clarity in the ERNIE-Image stage (its strength)
- Add text overlay layers in CapCut post-processing
- Reduce motion strength to keep text areas stable
Q2: Character distortion in video?
Cause: Video models may alter character features during motion
Solutions:
- Use ERNIE-Image character LoRA for more stable character images
- Lower LTX/Wan motion strength parameters
- Use ControlNet to constrain character morphology
Q3: Video too short?
Solutions:
- Multi-clip assembly: Generate multiple 5-second clips, assemble in CapCut
- Loop technique: Design matching first/last frames for seamless looping
- Slow-motion: Reduce playback speed in post-production
Q4: Unnatural motion?
Solutions:
- Adjust Motion Strength: too high = twitching, too low = still
- Try different video models: Wan 2.2 for character motion, LTX for scene motion
- Generate more dynamic compositions in ERNE-Image stage (tilted angles, motion blur prompts)
Part 6: Hardware Requirements and Performance
Recommended Setup
| Component | Minimum | Recommended |
|---|---|---|
| GPU | RTX 3060 12GB | RTX 4090 24GB |
| VRAM | 12GB | 24GB |
| RAM | 16GB | 32GB |
| Storage | 50GB SSD | 100GB NVMe |
Performance Reference (RTX 4090 24GB)
- ERNE-Image generation (768×768, 25 steps): ~3-5 seconds
- LTX 2.3 image-to-video (25 frames): ~30-60 seconds
- Wan 2.2 image-to-video (25 frames): ~45-90 seconds
- Topaz upscale (720p → 2K, 125 frames): ~2-5 minutes
Part 7: Commercial Applications
1. E-commerce Product Videos
- Cost comparison: Traditional product video $500-5,000/clip vs AI generation $0-5/clip (self-hosted)
- Efficiency: From 3-5 day shooting cycle to 1-2 hours
- Use case: Small businesses, independent brands, rapid market testing
2. Social Media Content
- Platforms: Instagram Reels, TikTok, YouTube Shorts
- Advantage: Batch generation, rapid iteration, no shooting location needed
3. Education & Training
- Use case: Online course visualization, operation demos, concept explanations
- Advantage: Text rendering + video = complete teaching materials
4. Advertising & Marketing
- Use case: Social media ads, landing page videos, A/B test materials
- Advantage: Quick generation of multiple versions for testing
Summary
The ERNIE-Image + video generation model combination workflow perfectly merges high-quality static image generation with dynamic video effects. Key takeaways:
- ERNIE-Image is the best first frame: High-quality static images + text rendering = video quality ceiling
- Two-stage workflow beats end-to-end: Mid-process review, flexible modification, multi-model combos
- ComfyUI is the core hub: Unified platform connecting ERNIE-Image, LTX, Wan, Topaz
- Post-processing matters: CapCut editing, Topaz upscaling, audio alignment are essential for professional output
As LTX 2.3 and Wan 2.2 continue to improve, AI video generation quality is rapidly increasing. ERNIE-Image, as a high-quality open-source text-to-image model, will be an indispensable starting point in this workflow.
References: RunComfy ERNIE-Image ComfyUI Workflow, mimicpc.com workflow library, YouTube "How to Make Professional AI Animations in 2026", HuggingFace baidu/ERNIE-Image