ERNIE-Image to AI Video ComfyUI Workflow: Complete Pipeline from Static Images to Animation

май 26, 2026

ERNIE-Image to AI Video ComfyUI Workflow: Complete Pipeline from Static Images to Animation

Abstract: AI video generation is moving from experimental to practical. This article details how to use ERNIE-Image as a high-quality static image generator, seamlessly connected with LTX 2.3, Wan 2.2 and other video generation models in ComfyUI, building a complete "text-to-image → image-to-video" automated pipeline.


Why ERNIE-Image Is the Best Starting Point for Image-to-Video

In AI video generation workflows, the quality of the first frame determines the ceiling of the final video. ERNIE-Image is an ideal choice for image-to-video scenarios in three key dimensions:

  1. High-quality static image generation: 8B DiT architecture, ranking among top open-source models in character consistency, scene detail, and lighting
  2. Text rendering capability: Accurate text can be embedded in static images, remaining readable after video generation
  3. Native ComfyUI support: RunComfy and mimicpc communities already provide complete ERNIE-Image ComfyUI workflows

Part 1: AI Image-to-Video Ecosystem Overview

Major Image-to-Video Models in 2026

Model Parameters Best Use Case Duration Resolution
LTX 2.3 Open Source General I2V, lip sync 5-10s 720p+
Wan 2.2 Open Source Cinematic motion, smooth transitions 5-10s 1080p
Higgsfield Closed Source Professional animation 10-30s 4K
Kling 1.5 Closed API High realism 10s 1080p

ERNIE-Image + Video Model Advantage

ERNIE-Image (static image) → LTX 2.3 / Wan 2.2 (video generation) → Post-processing (upscaling/color grading)

Why not just use text input directly on video models?

  • Video generation models' text understanding is generally weaker than dedicated text-to-image models
  • ERNIE-Image's text rendering and structured layout capabilities are unmatched
  • Two-stage workflows allow for human review and modification between stages

Part 2: ComfyUI Workflow Setup

Basic Workflow: ERNIE-Image → LTX 2.3 Image-to-Video

┌─────────────────────┐
│ ERNIE-Image Section  │
├─────────────────────┤
│ 1. Load Checkpoint  │
│    → ERNIE-Image    │
│       Base           │
│ 2. CLIP Text Encode │
│    → Scene description│
│ 3. Sampler          │
│    → 20-30 steps    │
│ 4. VAE Decode       │
│    → Static output  │
└────────┬────────────┘
         ↓
┌────────┴────────────┐
│ LTX 2.3 Section     │
├─────────────────────┤
│ 5. Load Checkpoint  │
│    → LTX 2.3        │
│ 6. Image → Video    │
│    → Input: ERNIE   │
│       output image   │
│ 7. Video Sampler    │
│    → 25 frames      │
│ 8. Video Decode     │
│    → MP4 output     │
└─────────────────────┘

Key Node Configurations

ERNIE-Image Section:

Checkpoint: baidu/ERNIE-Image Base
Sampler: DPM++ 2M Karras
Steps: 25
CFG: 6.0
Resolution: 768x768 (landscape) or 768x1024 (portrait)

LTX 2.3 Section:

Model: LTX-Video 2.3
First Frame: ERNIE-Image output
Motion Strength: 0.6-0.8 (adjust per scene)
Frames: 25 (~5 seconds @ 5fps)
FPS: 8-12 (video generation frame rate)

Part 3: Advanced Workflows — Multi-Model Combos

Workflow A: ERNIE-Image → LTX 2.3 → Topaz Upscale

ERNIE-Image (768x768) → LTX 2.3 (720p video) → Topaz Video AI (2K/4K)

Suitable for commercial scenarios requiring high-resolution output, such as product showcase videos and advertising materials.

Workflow B: ERNIE-Image → Wan 2.2 Cinematic Motion

ERNIE-Image (character/scene) → Wan 2.2 (60FPS smooth motion) → CapCut editing

Wan 2.2 excels in motion smoothness, ideal for character animation and product showcases.

Workflow C: ERNE-Image → LTX 2.3 Lip Sync

ERNIE-Image (character portrait) → LTX 2.3 IC-LoRA (lip sync) → Audio alignment

LTX 2.3's IC-LoRA supports precise lip synchronization, suitable for AI avatars and virtual presenters.


Part 4: Practical Cases

Case 1: E-commerce Product Video

Requirement: Generate product showcase videos in different settings

Workflow:

  1. ERNIE-Image generates product shot:

    Prompt: White ceramic coffee cup, on weathered oak table, 
    morning kitchen setting, soft window light from left, 
    warm tones, shallow depth of field, 8K commercial photography
    
  2. LTX 2.3 adds motion:

    • Steam rising effect
    • Slow light changes
    • Subtle camera movement
  3. Topaz upscale to 2K

Result: A 5-second 2K product video ready for e-commerce pages

Case 2: AI Anime Short Film

Requirement: Create a 30-second AI anime short

Workflow:

  1. ERNIE-Image generates keyframes (5 scenes × 2 images = 10 static images)
  2. LTX 2.3 generates 5-second transitions per frame
  3. CapCut assembly + voiceover + subtitles

Key Prompt Examples:

Scene 1: Anime style, young woman standing under cherry blossom tree, 
         pink petals falling, soft light, shallow depth of field, 
         cinematic composition

Scene 2: Anime style, same character, walking down school corridor,
sunlight streaming through windows, scattered books

Case 3: Social Media Short Video

Requirement: Instagram/TikTok style portrait short videos

Workflow:

  1. ERNIE-Image portrait generation:

    Resolution: 768x1024 (9:16)
    Prompt: Cinematic night scene, city skyline, neon lights, 
            cyberpunk style, portrait composition
    
  2. Wan 2.2 adds dynamic effects:

    • Flickering lights
    • Moving clouds
    • Slow camera push-in
  3. CapCut adds music and captions


Part 5: Common Issues and Optimization

Q1: Text is blurry in videos?

Cause: Video generation models tend to blur text

Solutions:

  • Ensure text clarity in the ERNIE-Image stage (its strength)
  • Add text overlay layers in CapCut post-processing
  • Reduce motion strength to keep text areas stable

Q2: Character distortion in video?

Cause: Video models may alter character features during motion

Solutions:

  • Use ERNIE-Image character LoRA for more stable character images
  • Lower LTX/Wan motion strength parameters
  • Use ControlNet to constrain character morphology

Q3: Video too short?

Solutions:

  • Multi-clip assembly: Generate multiple 5-second clips, assemble in CapCut
  • Loop technique: Design matching first/last frames for seamless looping
  • Slow-motion: Reduce playback speed in post-production

Q4: Unnatural motion?

Solutions:

  • Adjust Motion Strength: too high = twitching, too low = still
  • Try different video models: Wan 2.2 for character motion, LTX for scene motion
  • Generate more dynamic compositions in ERNE-Image stage (tilted angles, motion blur prompts)

Part 6: Hardware Requirements and Performance

Recommended Setup

Component Minimum Recommended
GPU RTX 3060 12GB RTX 4090 24GB
VRAM 12GB 24GB
RAM 16GB 32GB
Storage 50GB SSD 100GB NVMe

Performance Reference (RTX 4090 24GB)

  • ERNE-Image generation (768×768, 25 steps): ~3-5 seconds
  • LTX 2.3 image-to-video (25 frames): ~30-60 seconds
  • Wan 2.2 image-to-video (25 frames): ~45-90 seconds
  • Topaz upscale (720p → 2K, 125 frames): ~2-5 minutes

Part 7: Commercial Applications

1. E-commerce Product Videos

  • Cost comparison: Traditional product video $500-5,000/clip vs AI generation $0-5/clip (self-hosted)
  • Efficiency: From 3-5 day shooting cycle to 1-2 hours
  • Use case: Small businesses, independent brands, rapid market testing

2. Social Media Content

  • Platforms: Instagram Reels, TikTok, YouTube Shorts
  • Advantage: Batch generation, rapid iteration, no shooting location needed

3. Education & Training

  • Use case: Online course visualization, operation demos, concept explanations
  • Advantage: Text rendering + video = complete teaching materials

4. Advertising & Marketing

  • Use case: Social media ads, landing page videos, A/B test materials
  • Advantage: Quick generation of multiple versions for testing

Summary

The ERNIE-Image + video generation model combination workflow perfectly merges high-quality static image generation with dynamic video effects. Key takeaways:

  1. ERNIE-Image is the best first frame: High-quality static images + text rendering = video quality ceiling
  2. Two-stage workflow beats end-to-end: Mid-process review, flexible modification, multi-model combos
  3. ComfyUI is the core hub: Unified platform connecting ERNIE-Image, LTX, Wan, Topaz
  4. Post-processing matters: CapCut editing, Topaz upscaling, audio alignment are essential for professional output

As LTX 2.3 and Wan 2.2 continue to improve, AI video generation quality is rapidly increasing. ERNIE-Image, as a high-quality open-source text-to-image model, will be an indispensable starting point in this workflow.


References: RunComfy ERNIE-Image ComfyUI Workflow, mimicpc.com workflow library, YouTube "How to Make Professional AI Animations in 2026", HuggingFace baidu/ERNIE-Image

ERNIE-Image Team