ERNIE-Image Technical Report Deep Dive: DPO Alignment, Aesthetic Evaluation, and Data Pipeline

The Baidu ERNIE-Image team published a comprehensive technical report on arXiv (arXiv:2605.25347) in May 2026, revealing the training strategies, aesthetic evaluation systems, and data pipeline design behind this 8B-parameter open-source text-to-image model. This article provides an in-depth analysis of the core technical details, helping developers and researchers understand how ERNIE-Image achieves near-closed-source flagship performance with just 8B parameters.

Why This Technical Report Matters

In the AI image generation landscape, technical reports are among the most valuable resources. Unlike Midjourney or DALL-E, which never disclose training details, ERNIE-Image chose full transparency — from data construction and pre-training strategies to alignment optimization, everything is open-sourced.

The report's core contributions can be summarized in three areas:

DPO for Flow Matching: First adaptation of Direct Preference Optimization to the Flow Matching framework for diffusion model alignment
ERNIE-Image-Aes aesthetic evaluation model: SRCC 0.7445, far exceeding traditional solutions like LAION AES (0.2944)
Swiss-Tournament human annotation system: Using Swiss-system tournament format instead of Likert scoring to eliminate score drift

Architecture Overview: An Elegant 8B-Parameter Design

┌─────────────────────────────────────────┐
│              ERNIE-Image                 │
├────────────┬────────────┬───────────────┤
│  Ministral │   DiT      │  FLUX.2 VAE   │
│     -3     │   8B       │              │
│  (3B) Text │ Transformer│  Latent      │
│  Encoder   │            │  Autoencoder │
└─────┬──────┴─────┬──────┴──────┬───────┘
      │            │             │
┌─────▼──────┐ ┌──▼──────┐ ┌────▼───────┐
│ Ministral3 │ │ Flow    │ │ Autoencoder │
│ ForCausalLM│ │ Match   │ │ KL Flux 2   │
│ Prompt     │ │ Euler   │ │ (161 MB)    │
│ Enhancer   │ │ Scheduler│             │
│ (PE)       │ │         │              │
└────────────┘ └─────────┘ └────────────┘

Core components of ERNIE-Image:

Component	Model	Size	Role
DiT Transformer	ErnieImageTransformer2DModel	~15 GB	Diffusion backbone
Text Encoder	Ministral-3	3B / 7.2 GB	Text encoding
Prompt Enhancer	Ministral3ForCausalLM	3B / 7.2 GB	Automatic prompt enrichment
VAE	AutoencoderKLFlux2	161 MB	Latent autoencoder
Total		~8B	~29.5 GB

Design highlight: Ministral-3 (3B) as the text encoder instead of a larger LLM significantly reduces inference memory while maintaining the ability to understand long/complex instructions. This embodies a "smaller model + better data" design philosophy.

Pre-training Data Pipeline: Bottom-Up Fine-Grained Construction

The pre-training data pipeline is the core innovation of the ERNIE-Image project. The report reveals four key stages:

Fine-Grained Classification System

Traditional diffusion models typically use coarse category labels (e.g., "landscape", "portrait"). ERNIE-Image built a 10,000 fine-grained visual categories classification system:

├── Photography
│   ├── Portrait
│   │   ├── Studio Portrait
│   │   ├── Environmental Portrait
│   │   └── Candid Portrait
│   ├── Landscape
│   │   ├── Mountain Landscape
│   │   └── Coastal Landscape
│   └── ...
├── Illustration
│   ├── Anime/Manga
│   ├── Watercolor
│   └── ...
└── Graphic Design
    ├── Poster
    ├── Infographic
    └── ...

This fine-grained classification serves two critical purposes:

Preserves long-tail concepts: Prevents dominant categories (e.g., "portrait") from overwhelming training
Enables hierarchical sampling: Weighted sampling based on category quality and quantity

VLM Automatic Annotation

The team fine-tuned a powerful VLM (Qwen3) as an annotator, specifically extracting structured descriptions and text content from images:

Input image: [Product poster with text "Summer Sale 50% OFF"]
VLM output: "A summer sale poster with bold red text reading 'Summer Sale
            50% OFF', featuring a minimalist design with product images and
            price tags."

This step is crucial for text rendering capability — the model needs to "see" what text is in the images to learn generating correct text.

Aesthetic Scoring Filtering

Every image is scored by the ERNIE-Image-Aes model for quality. The sampling strategy is:

$$\text{Category Weight} = \text{Number of Images} \times \text{Average Aesthetic Score}$$

This means high-quality categories receive more training samples, while low-quality categories are naturally diluted.

Resolution Curriculum Learning

Pre-training uses a three-stage progressive resolution scaling:

Stage 1: 256×256 → Learn basic composition and color
Stage 2: 512×512 → Learn details and textures
Stage 3: 1024×1024 → Learn fine features and text rendering

Key detail: training uses diverse aspect ratios, not just squares, significantly improving generation quality for non-square scenarios like posters and banners.

DPO for Flow Matching: Aligning Diffusion Models

This is one of the most important technical innovations in this report. Traditional DPO (Direct Preference Optimization) is used for LLM alignment; the ERNIE-Image team is the first to extend it to the Flow Matching framework.

Why DPO?

Pre-training makes the model "able to generate," but not necessarily "generate well." DPO aims to make model outputs more aligned with human aesthetic preferences:

Reduce malformed fingers and extra limbs
Improve color harmony
Enhance compositional balance

Technical Implementation

In the Flow Matching framework, DPO's core idea replaces traditional contrastive learning loss with L2 velocity reconstruction error:

# Simplified DPO for Flow Matching
def dpo_loss(v_chosen, v_rejected, v_ref):
    """
    v_chosen: velocity field from human-preferred generation
    v_rejected: velocity field from human-dispreferred generation
    v_ref: velocity field from reference model (prevents reward hacking)
    """
    chosen_error = l2_loss(v_chosen, v_ref)
    rejected_error = l2_loss(v_rejected, v_ref)
    loss = -log_sigmoid(beta * (rejected_error - chosen_error))
    return loss

The report specifically mentions Anchor Losses to prevent reward hacking and representation collapse — classic problems in large-scale DPO training.

SFT Stage: K2.5 VLM Rewrites Prompts

Before DPO, ERNIE-Image undergoes SFT (Supervised Fine-Tuning), using K2.5 VLM to rewrite raw captions into diverse user-style prompts:

Original caption: "A red sports car parked on a mountain road at sunset"
Rewritten as:

Keyword style: "red sports car, mountain road, sunset, dynamic angle"
Natural language: "A sleek red sports car parked on a winding mountain

road during golden hour, with dramatic clouds in the background"
Instruction style: "Generate a photorealistic image of a red sports car

on a mountain road at sunset, cinematic lighting, 8K quality"

This diverse prompt format training significantly improves the model's robustness to real-world user inputs.

MT-DMD: Multi-Teacher Distillation for 8-Step Turbo

ERNIE-Image-Turbo's 8-step generation capability comes from Multi-Teacher Distillation (MT-DMD) technology.

Limitations of Single-Teacher Distillation

Traditional knowledge distillation uses a single "teacher model" to guide a "student model." But in diffusion models, different diffusion steps require different capabilities:

Early steps (t=50→30): Overall composition and layout
Mid steps (t=30→15): Texture and details
Late steps (t=15→1): Text rendering and edge sharpening

A single teacher model struggles to excel at all stages.

MT-DMD Solution

MT-DMD uses a committee of domain experts with dynamic routing at different diffusion steps:

┌──────────────────────────────────────┐
│         Student Model (Turbo)        │
│           8 Inference Steps          │
└──────────┬────────────┬──────────────┘
           │            │
    ┌──────▼──────┐ ┌───▼────────────┐
    │ Teacher A   │ │ Teacher B      │
    │ (Layout)    │ │ (Texture)      │
    │ Steps 50-30 │ │ Steps 30-15    │
    └─────────────┘ └────────────────┘
           │
    ┌──────▼────────────┐
    │ Teacher C         │
    │ (Text/Edge)       │
    │ Steps 15-1        │
    └───────────────────┘

This design allows ERNIE-Image-Turbo to maintain aesthetic quality comparable to the 50-step Base version while using only 8 steps.

ERNIE-Image-Aes: Breakthrough in Aesthetic Evaluation

Why Build a Custom Aesthetic Model?

Existing aesthetic scoring models (LAION AES, ArtiMuse, UniPercept) have significant biases:

Model	SRCC	PLCC	Main Bias
LAION AES	0.2944	0.3138	Over-prefers AI-generated content
ArtiMuse	0.4277	0.4704	Over-prefers B&W and casual snapshots
UniPercept	0.4533	0.4748	Same as above
ERNIE-Image-Aes	0.7445	0.7598	Minimal bias

ERNIE-Image-Aes jumps from ~0.45 SRCC to 0.74 — a qualitative leap meaning its scores align with human preferences dramatically better.

Swiss-Tournament Annotation System

Traditional aesthetic annotation uses Likert scoring (1-5 points), which suffers from severe score drift — different annotators and time periods have inconsistent rating standards. ERNIE-Image adopted a Swiss-system tournament:

Round 1: Image A vs Image B → Winner advances
Round 2: Winner vs another image → Winner continues
...
Final ranking: Aesthetic level 1-10 based on number of wins

Advantages:

Relative comparison is more reliable: Humans are better at comparing than absolute scoring
Score drift auto-eliminated: Same image maintains consistency across different round comparisons
Computational efficiency: Swiss-system requires fewer comparisons than ELO ranking

ERIA-1K Benchmark

ERIA-1K is the team's human-annotated benchmark with 1,000 images reflecting real-world distributions:

Category	Percentage
Photography	49.28%
Illustration/Anime	23.16%
Graphic Design	11.14%
Mixed Web	10.44%
Film Photography	5.42%
Product/Collectible	0.56%

Key design: not limited to professional photography, covering common real-world image types for industrially practical evaluation.

Performance Benchmarks: 8B Parameters in Practice

Human Evaluation (Internal Test Set)

Model	Total	Spatial	World	Physics	Aesthetic	Style	Creativity	Knowledge
Nano Banana 2.0 (Closed)	5.39	95.54	98.51	95.24	91.37	90.77	67.86	99.40
ERNIE-Image (Open)	5.07	89.88	94.05	92.56	83.04	84.82	62.80	95.24
Seedream 5.0 (Closed)	5.03	90.48	97.32	91.96	80.65	81.55	61.01	97.02

Key conclusion: ERNIE-Image is currently the closest open-source model to top closed-source systems, with a total score only ~0.32 points behind Nano Banana 2.0.

Quantitative Benchmarks

GenEval (General Synthesis): 0.89 overall (highest among open-source)
LongText-Bench (Text Rendering): 0.973 (w/ PE), EN: 0.980, ZH: 0.966
OneIG-Bench (Semantic/Style Alignment): 0.575 (EN), 0.554 (ZH), leading open-source

Summary: Technical Insights from ERNIE-Image

The ERNIE-Image technical report conveys a clear message: data quality and training strategy matter more than raw parameter count.

8B parameters are sufficient: Through refined data pipelines and alignment training, an 8B model can approach larger closed-source systems
DPO for Diffusion is viable: First proof of DPO effectiveness in the Flow Matching framework
Custom aesthetic evaluation is necessary: General-purpose aesthetic models have severe biases; building your own evaluator is key to improving generation quality
Multi-teacher distillation beats single-teacher: Different diffusion steps need different capabilities; dynamic routing significantly improves distillation efficiency

For developers and researchers, ERNIE-Image's open-source code and detailed technical report provide a complete reference implementation for building high-quality diffusion models.

Further Reading:

Original technical report: arXiv:2605.25347

HuggingFace model: baidu/ERNIE-Image

GitHub repository: baidu/ernie-image

Related: EI-034 SGLang Production Deployment, EI-028 NVFP4 Quantization Guide