ERNIE-Image to AI Video ComfyUI Workflow: Complete Pipeline from Static Images to Animation

Abstract: AI video generation is moving from experimental to practical. This article details how to use ERNIE-Image as a high-quality static image generator, seamlessly connected with LTX 2.3, Wan 2.2 and other video generation models in ComfyUI, building a complete "text-to-image → image-to-video" automated pipeline.

Why ERNIE-Image Is the Best Starting Point for Image-to-Video

In AI video generation workflows, the quality of the first frame determines the ceiling of the final video. ERNIE-Image is an ideal choice for image-to-video scenarios in three key dimensions:

High-quality static image generation: 8B DiT architecture, ranking among top open-source models in character consistency, scene detail, and lighting
Text rendering capability: Accurate text can be embedded in static images, remaining readable after video generation
Native ComfyUI support: RunComfy and mimicpc communities already provide complete ERNIE-Image ComfyUI workflows

Part 1: AI Image-to-Video Ecosystem Overview

Major Image-to-Video Models in 2026

Model	Parameters	Best Use Case	Duration	Resolution
LTX 2.3	Open Source	General I2V, lip sync	5-10s	720p+
Wan 2.2	Open Source	Cinematic motion, smooth transitions	5-10s	1080p
Higgsfield	Closed Source	Professional animation	10-30s	4K
Kling 1.5	Closed API	High realism	10s	1080p

ERNIE-Image + Video Model Advantage

ERNIE-Image (static image) → LTX 2.3 / Wan 2.2 (video generation) → Post-processing (upscaling/color grading)

Why not just use text input directly on video models?

Video generation models' text understanding is generally weaker than dedicated text-to-image models
ERNIE-Image's text rendering and structured layout capabilities are unmatched
Two-stage workflows allow for human review and modification between stages

Part 2: ComfyUI Workflow Setup

Basic Workflow: ERNIE-Image → LTX 2.3 Image-to-Video

┌─────────────────────┐
│ ERNIE-Image Section  │
├─────────────────────┤
│ 1. Load Checkpoint  │
│    → ERNIE-Image    │
│       Base           │
│ 2. CLIP Text Encode │
│    → Scene description│
│ 3. Sampler          │
│    → 20-30 steps    │
│ 4. VAE Decode       │
│    → Static output  │
└────────┬────────────┘
         ↓
┌────────┴────────────┐
│ LTX 2.3 Section     │
├─────────────────────┤
│ 5. Load Checkpoint  │
│    → LTX 2.3        │
│ 6. Image → Video    │
│    → Input: ERNIE   │
│       output image   │
│ 7. Video Sampler    │
│    → 25 frames      │
│ 8. Video Decode     │
│    → MP4 output     │
└─────────────────────┘

Key Node Configurations

ERNIE-Image Section:

Checkpoint: baidu/ERNIE-Image Base
Sampler: DPM++ 2M Karras
Steps: 25
CFG: 6.0
Resolution: 768x768 (landscape) or 768x1024 (portrait)

LTX 2.3 Section:

Model: LTX-Video 2.3
First Frame: ERNIE-Image output
Motion Strength: 0.6-0.8 (adjust per scene)
Frames: 25 (~5 seconds @ 5fps)
FPS: 8-12 (video generation frame rate)

Part 3: Advanced Workflows — Multi-Model Combos

Workflow A: ERNIE-Image → LTX 2.3 → Topaz Upscale

ERNIE-Image (768x768) → LTX 2.3 (720p video) → Topaz Video AI (2K/4K)

Suitable for commercial scenarios requiring high-resolution output, such as product showcase videos and advertising materials.

Workflow B: ERNIE-Image → Wan 2.2 Cinematic Motion

ERNIE-Image (character/scene) → Wan 2.2 (60FPS smooth motion) → CapCut editing

Wan 2.2 excels in motion smoothness, ideal for character animation and product showcases.

Workflow C: ERNE-Image → LTX 2.3 Lip Sync

ERNIE-Image (character portrait) → LTX 2.3 IC-LoRA (lip sync) → Audio alignment

LTX 2.3's IC-LoRA supports precise lip synchronization, suitable for AI avatars and virtual presenters.

Part 4: Practical Cases

Case 1: E-commerce Product Video

Requirement: Generate product showcase videos in different settings

Workflow:

ERNIE-Image generates product shot:

Prompt: White ceramic coffee cup, on weathered oak table, 
morning kitchen setting, soft window light from left, 
warm tones, shallow depth of field, 8K commercial photography

LTX 2.3 adds motion:
- Steam rising effect
- Slow light changes
- Subtle camera movement
Topaz upscale to 2K

Result: A 5-second 2K product video ready for e-commerce pages

Case 2: AI Anime Short Film

Requirement: Create a 30-second AI anime short

Workflow:

ERNIE-Image generates keyframes (5 scenes × 2 images = 10 static images)
LTX 2.3 generates 5-second transitions per frame
CapCut assembly + voiceover + subtitles

Key Prompt Examples:

Scene 1: Anime style, young woman standing under cherry blossom tree, pink petals falling, soft light, shallow depth of field, cinematic composition

Scene 2: Anime style, same character, walking down school corridor, sunlight streaming through windows, scattered books

Case 3: Social Media Short Video

Requirement: Instagram/TikTok style portrait short videos

Workflow:

ERNIE-Image portrait generation:

Resolution: 768x1024 (9:16)
Prompt: Cinematic night scene, city skyline, neon lights, 
        cyberpunk style, portrait composition

Wan 2.2 adds dynamic effects:
- Flickering lights
- Moving clouds
- Slow camera push-in
CapCut adds music and captions

Part 5: Common Issues and Optimization

Q1: Text is blurry in videos?

Cause: Video generation models tend to blur text

Solutions:

Ensure text clarity in the ERNIE-Image stage (its strength)
Add text overlay layers in CapCut post-processing
Reduce motion strength to keep text areas stable

Q2: Character distortion in video?

Cause: Video models may alter character features during motion

Solutions:

Use ERNIE-Image character LoRA for more stable character images
Lower LTX/Wan motion strength parameters
Use ControlNet to constrain character morphology

Q3: Video too short?

Solutions:

Multi-clip assembly: Generate multiple 5-second clips, assemble in CapCut
Loop technique: Design matching first/last frames for seamless looping
Slow-motion: Reduce playback speed in post-production

Q4: Unnatural motion?

Solutions:

Adjust Motion Strength: too high = twitching, too low = still
Try different video models: Wan 2.2 for character motion, LTX for scene motion
Generate more dynamic compositions in ERNE-Image stage (tilted angles, motion blur prompts)

Part 6: Hardware Requirements and Performance

Recommended Setup

Component	Minimum	Recommended
GPU	RTX 3060 12GB	RTX 4090 24GB
VRAM	12GB	24GB
RAM	16GB	32GB
Storage	50GB SSD	100GB NVMe

Performance Reference (RTX 4090 24GB)

ERNE-Image generation (768×768, 25 steps): ~3-5 seconds
LTX 2.3 image-to-video (25 frames): ~30-60 seconds
Wan 2.2 image-to-video (25 frames): ~45-90 seconds
Topaz upscale (720p → 2K, 125 frames): ~2-5 minutes

Part 7: Commercial Applications

1. E-commerce Product Videos

Cost comparison: Traditional product video $500-5,000/clip vs AI generation $0-5/clip (self-hosted)
Efficiency: From 3-5 day shooting cycle to 1-2 hours
Use case: Small businesses, independent brands, rapid market testing

2. Social Media Content

Platforms: Instagram Reels, TikTok, YouTube Shorts
Advantage: Batch generation, rapid iteration, no shooting location needed

3. Education & Training

Use case: Online course visualization, operation demos, concept explanations
Advantage: Text rendering + video = complete teaching materials

4. Advertising & Marketing

Use case: Social media ads, landing page videos, A/B test materials
Advantage: Quick generation of multiple versions for testing

Summary

The ERNIE-Image + video generation model combination workflow perfectly merges high-quality static image generation with dynamic video effects. Key takeaways:

ERNIE-Image is the best first frame: High-quality static images + text rendering = video quality ceiling
Two-stage workflow beats end-to-end: Mid-process review, flexible modification, multi-model combos
ComfyUI is the core hub: Unified platform connecting ERNIE-Image, LTX, Wan, Topaz
Post-processing matters: CapCut editing, Topaz upscaling, audio alignment are essential for professional output

As LTX 2.3 and Wan 2.2 continue to improve, AI video generation quality is rapidly increasing. ERNIE-Image, as a high-quality open-source text-to-image model, will be an indispensable starting point in this workflow.

References: RunComfy ERNIE-Image ComfyUI Workflow, mimicpc.com workflow library, YouTube "How to Make Professional AI Animations in 2026", HuggingFace baidu/ERNIE-Image