ERNIE-Image Technical Report Deep Dive: DPO Alignment, Aesthetic Evaluation, and Data Pipeline

mei 30, 2026

ERNIE-Image Technical Report Deep Dive: DPO Alignment, Aesthetic Evaluation, and Data Pipeline

The Baidu ERNIE-Image team published a comprehensive technical report on arXiv (arXiv:2605.25347) in May 2026, revealing the training strategies, aesthetic evaluation systems, and data pipeline design behind this 8B-parameter open-source text-to-image model. This article provides an in-depth analysis of the core technical details, helping developers and researchers understand how ERNIE-Image achieves near-closed-source flagship performance with just 8B parameters.

Why This Technical Report Matters

In the AI image generation landscape, technical reports are among the most valuable resources. Unlike Midjourney or DALL-E, which never disclose training details, ERNIE-Image chose full transparency — from data construction and pre-training strategies to alignment optimization, everything is open-sourced.

The report's core contributions can be summarized in three areas:

  1. DPO for Flow Matching: First adaptation of Direct Preference Optimization to the Flow Matching framework for diffusion model alignment
  2. ERNIE-Image-Aes aesthetic evaluation model: SRCC 0.7445, far exceeding traditional solutions like LAION AES (0.2944)
  3. Swiss-Tournament human annotation system: Using Swiss-system tournament format instead of Likert scoring to eliminate score drift

Architecture Overview: An Elegant 8B-Parameter Design

┌─────────────────────────────────────────┐
│              ERNIE-Image                 │
├────────────┬────────────┬───────────────┤
│  Ministral │   DiT      │  FLUX.2 VAE   │
│     -3     │   8B       │              │
│  (3B) Text │ Transformer│  Latent      │
│  Encoder   │            │  Autoencoder │
└─────┬──────┴─────┬──────┴──────┬───────┘
      │            │             │
┌─────▼──────┐ ┌──▼──────┐ ┌────▼───────┐
│ Ministral3 │ │ Flow    │ │ Autoencoder │
│ ForCausalLM│ │ Match   │ │ KL Flux 2   │
│ Prompt     │ │ Euler   │ │ (161 MB)    │
│ Enhancer   │ │ Scheduler│             │
│ (PE)       │ │         │              │
└────────────┘ └─────────┘ └────────────┘

Core components of ERNIE-Image:

Component Model Size Role
DiT Transformer ErnieImageTransformer2DModel ~15 GB Diffusion backbone
Text Encoder Ministral-3 3B / 7.2 GB Text encoding
Prompt Enhancer Ministral3ForCausalLM 3B / 7.2 GB Automatic prompt enrichment
VAE AutoencoderKLFlux2 161 MB Latent autoencoder
Total ~8B ~29.5 GB

Design highlight: Ministral-3 (3B) as the text encoder instead of a larger LLM significantly reduces inference memory while maintaining the ability to understand long/complex instructions. This embodies a "smaller model + better data" design philosophy.

Pre-training Data Pipeline: Bottom-Up Fine-Grained Construction

The pre-training data pipeline is the core innovation of the ERNIE-Image project. The report reveals four key stages:

Fine-Grained Classification System

Traditional diffusion models typically use coarse category labels (e.g., "landscape", "portrait"). ERNIE-Image built a 10,000 fine-grained visual categories classification system:

├── Photography
│   ├── Portrait
│   │   ├── Studio Portrait
│   │   ├── Environmental Portrait
│   │   └── Candid Portrait
│   ├── Landscape
│   │   ├── Mountain Landscape
│   │   └── Coastal Landscape
│   └── ...
├── Illustration
│   ├── Anime/Manga
│   ├── Watercolor
│   └── ...
└── Graphic Design
    ├── Poster
    ├── Infographic
    └── ...

This fine-grained classification serves two critical purposes:

  1. Preserves long-tail concepts: Prevents dominant categories (e.g., "portrait") from overwhelming training
  2. Enables hierarchical sampling: Weighted sampling based on category quality and quantity

VLM Automatic Annotation

The team fine-tuned a powerful VLM (Qwen3) as an annotator, specifically extracting structured descriptions and text content from images:

Input image: [Product poster with text "Summer Sale 50% OFF"]
VLM output: "A summer sale poster with bold red text reading 'Summer Sale
            50% OFF', featuring a minimalist design with product images and
            price tags."

This step is crucial for text rendering capability — the model needs to "see" what text is in the images to learn generating correct text.

Aesthetic Scoring Filtering

Every image is scored by the ERNIE-Image-Aes model for quality. The sampling strategy is:

$$\text{Category Weight} = \text{Number of Images} \times \text{Average Aesthetic Score}$$

This means high-quality categories receive more training samples, while low-quality categories are naturally diluted.

Resolution Curriculum Learning

Pre-training uses a three-stage progressive resolution scaling:

Stage 1: 256×256 → Learn basic composition and color
Stage 2: 512×512 → Learn details and textures
Stage 3: 1024×1024 → Learn fine features and text rendering

Key detail: training uses diverse aspect ratios, not just squares, significantly improving generation quality for non-square scenarios like posters and banners.

DPO for Flow Matching: Aligning Diffusion Models

This is one of the most important technical innovations in this report. Traditional DPO (Direct Preference Optimization) is used for LLM alignment; the ERNIE-Image team is the first to extend it to the Flow Matching framework.

Why DPO?

Pre-training makes the model "able to generate," but not necessarily "generate well." DPO aims to make model outputs more aligned with human aesthetic preferences:

  • Reduce malformed fingers and extra limbs
  • Improve color harmony
  • Enhance compositional balance

Technical Implementation

In the Flow Matching framework, DPO's core idea replaces traditional contrastive learning loss with L2 velocity reconstruction error:

# Simplified DPO for Flow Matching
def dpo_loss(v_chosen, v_rejected, v_ref):
    """
    v_chosen: velocity field from human-preferred generation
    v_rejected: velocity field from human-dispreferred generation
    v_ref: velocity field from reference model (prevents reward hacking)
    """
    chosen_error = l2_loss(v_chosen, v_ref)
    rejected_error = l2_loss(v_rejected, v_ref)
    loss = -log_sigmoid(beta * (rejected_error - chosen_error))
    return loss

The report specifically mentions Anchor Losses to prevent reward hacking and representation collapse — classic problems in large-scale DPO training.

SFT Stage: K2.5 VLM Rewrites Prompts

Before DPO, ERNIE-Image undergoes SFT (Supervised Fine-Tuning), using K2.5 VLM to rewrite raw captions into diverse user-style prompts:

Original caption: "A red sports car parked on a mountain road at sunset"

Rewritten as:

  • Keyword style: "red sports car, mountain road, sunset, dynamic angle"
  • Natural language: "A sleek red sports car parked on a winding mountain
    road during golden hour, with dramatic clouds in the background"
  • Instruction style: "Generate a photorealistic image of a red sports car
    on a mountain road at sunset, cinematic lighting, 8K quality"

This diverse prompt format training significantly improves the model's robustness to real-world user inputs.

MT-DMD: Multi-Teacher Distillation for 8-Step Turbo

ERNIE-Image-Turbo's 8-step generation capability comes from Multi-Teacher Distillation (MT-DMD) technology.

Limitations of Single-Teacher Distillation

Traditional knowledge distillation uses a single "teacher model" to guide a "student model." But in diffusion models, different diffusion steps require different capabilities:

  • Early steps (t=50→30): Overall composition and layout
  • Mid steps (t=30→15): Texture and details
  • Late steps (t=15→1): Text rendering and edge sharpening

A single teacher model struggles to excel at all stages.

MT-DMD Solution

MT-DMD uses a committee of domain experts with dynamic routing at different diffusion steps:

┌──────────────────────────────────────┐
│         Student Model (Turbo)        │
│           8 Inference Steps          │
└──────────┬────────────┬──────────────┘
           │            │
    ┌──────▼──────┐ ┌───▼────────────┐
    │ Teacher A   │ │ Teacher B      │
    │ (Layout)    │ │ (Texture)      │
    │ Steps 50-30 │ │ Steps 30-15    │
    └─────────────┘ └────────────────┘
           │
    ┌──────▼────────────┐
    │ Teacher C         │
    │ (Text/Edge)       │
    │ Steps 15-1        │
    └───────────────────┘

This design allows ERNIE-Image-Turbo to maintain aesthetic quality comparable to the 50-step Base version while using only 8 steps.

ERNIE-Image-Aes: Breakthrough in Aesthetic Evaluation

Why Build a Custom Aesthetic Model?

Existing aesthetic scoring models (LAION AES, ArtiMuse, UniPercept) have significant biases:

Model SRCC PLCC Main Bias
LAION AES 0.2944 0.3138 Over-prefers AI-generated content
ArtiMuse 0.4277 0.4704 Over-prefers B&W and casual snapshots
UniPercept 0.4533 0.4748 Same as above
ERNIE-Image-Aes 0.7445 0.7598 Minimal bias

ERNIE-Image-Aes jumps from ~0.45 SRCC to 0.74 — a qualitative leap meaning its scores align with human preferences dramatically better.

Swiss-Tournament Annotation System

Traditional aesthetic annotation uses Likert scoring (1-5 points), which suffers from severe score drift — different annotators and time periods have inconsistent rating standards. ERNIE-Image adopted a Swiss-system tournament:

Round 1: Image A vs Image B → Winner advances
Round 2: Winner vs another image → Winner continues
...
Final ranking: Aesthetic level 1-10 based on number of wins

Advantages:

  • Relative comparison is more reliable: Humans are better at comparing than absolute scoring
  • Score drift auto-eliminated: Same image maintains consistency across different round comparisons
  • Computational efficiency: Swiss-system requires fewer comparisons than ELO ranking

ERIA-1K Benchmark

ERIA-1K is the team's human-annotated benchmark with 1,000 images reflecting real-world distributions:

Category Percentage
Photography 49.28%
Illustration/Anime 23.16%
Graphic Design 11.14%
Mixed Web 10.44%
Film Photography 5.42%
Product/Collectible 0.56%

Key design: not limited to professional photography, covering common real-world image types for industrially practical evaluation.

Performance Benchmarks: 8B Parameters in Practice

Human Evaluation (Internal Test Set)

Model Total Spatial World Physics Aesthetic Style Creativity Knowledge
Nano Banana 2.0 (Closed) 5.39 95.54 98.51 95.24 91.37 90.77 67.86 99.40
ERNIE-Image (Open) 5.07 89.88 94.05 92.56 83.04 84.82 62.80 95.24
Seedream 5.0 (Closed) 5.03 90.48 97.32 91.96 80.65 81.55 61.01 97.02

Key conclusion: ERNIE-Image is currently the closest open-source model to top closed-source systems, with a total score only ~0.32 points behind Nano Banana 2.0.

Quantitative Benchmarks

  • GenEval (General Synthesis): 0.89 overall (highest among open-source)
  • LongText-Bench (Text Rendering): 0.973 (w/ PE), EN: 0.980, ZH: 0.966
  • OneIG-Bench (Semantic/Style Alignment): 0.575 (EN), 0.554 (ZH), leading open-source

Summary: Technical Insights from ERNIE-Image

The ERNIE-Image technical report conveys a clear message: data quality and training strategy matter more than raw parameter count.

  1. 8B parameters are sufficient: Through refined data pipelines and alignment training, an 8B model can approach larger closed-source systems
  2. DPO for Diffusion is viable: First proof of DPO effectiveness in the Flow Matching framework
  3. Custom aesthetic evaluation is necessary: General-purpose aesthetic models have severe biases; building your own evaluator is key to improving generation quality
  4. Multi-teacher distillation beats single-teacher: Different diffusion steps need different capabilities; dynamic routing significantly improves distillation efficiency

For developers and researchers, ERNIE-Image's open-source code and detailed technical report provide a complete reference implementation for building high-quality diffusion models.


Further Reading:

ERNIE-Image Team